Final version published in: Journal of the American Statistical Association
In studies of molecular profiling and biological pathway analysis using DNA microarray gene expression data we are utilising a broad class of sparse latent factor and regression models for large-scale multivariate analysis and regression prediction. We present examples of these applications in the current paper, along with discussion of key aspects of the modelling and computational methodology. Our case studies are drawn from breast cancer genomics, where we are concerned with predictive/prognostic uses of aggregate patterns in gene expression profiles in clinical contexts, and also the investigation and characterisation of heterogeneity of structure related to specific oncogenic pathways. Based on the metaphor of statistically derived \lq\lq factors" as representing biological \lq\lq subpathway" structure, we explore the decomposition of fitted sparse factor models into pathway subcomponents, and how these components overlay multiple aspects of known biological structure in this network. We discuss the discovery and predictive uses of this approach, and the ability to use such models to generate enrichment of existing biological descriptions through identification of interactions between factors and subsequent experimental validation. We further illustrate the coupled use of predictive factor regression models with the high-dimensional sparse factor analysis of expression profiles. Our methodology is based on sparsity modelling of multivariate regression, anova and latent factor models, and a general class of models that combines all components. Novel and effective sparsity priors address the inherent questions of dimension reduction and multiple comparisons, as well as scalability of the methodology. The models include practically relevant non-Gaussian/non-parametric components for modelling latent structure underlying often quite complex non-Gaussianity in multivariate expression patterns related to underlying biology. Model search and fitting are addressed through stochastic simulation and evolutionary stochastic search methods that are exemplified in oncogenic pathway studies. Supplementary supporting material provides more details of the applications as well as examples of the use of freely available software tools implementing the methodology.
See also the BFRM software link for software downloads