Original: April 2008, Final version: July 2009.
To appear in
Frontiers of Statistical Decision Making and Bayesian Analysis, (Springer, 2010).
We present Bayesian models and computational methods for the problem of matching predictions from molecular studies with biological databases of reference gene sets related to known pathways - the problem of pathway annotation of summary results of an experiment or observational study. In areas such as cancer genomics, linking quantified, experimentally defined gene expression signatures with known biological pathway gene sets is essential to improving the understanding of the complexity of molecular pathways related to outcome. Our new models address this key challenge. Our focus and examples are on studies using gene expression microarrays, though the theory and methods are quite general. Our models for probabilistic pathway annotation (PROPA) analysis address the problem formally, statistically, and deliver probabilities over pathways for any experimental signature. This allows quantitative assessment and ranking of pathways putatively linked to an experimental or observational phenotype. The models integrate qualitative biological information into the analysis and generate coherent inference on uncertainties about gene pathway membership that can inform the revision of pathway data bases.
Our analysis relies on simulation-based computation in high-dimensional models, and introduces a novel extension of variational methods for computation of model evidence, or marginal likelihood functions, that are central to the comparison of multiple biological pathways. Examples highlight the methodology using both simulated and real data, the latter involving the ER and ErbB2/Her2-nu hormonal pathways in breast cancer. Keywords: biological pathway analysis, cancer genomics, factor regression models, gene expression signatures, gene set enrichment, marginal likelihood computation, Monte Carlo variational approximation, sparse factor analysis
See also the PROPA software link for software downloads
We are grateful to Ashley Chi, Joe Lucas and Chunlin Ji of Duke University for discussions and important input, and to Quanli Wang for computational contributions. We acknowledge support of the National Science Foundation (grants DMS-0102227 and DMS-0342172) and the National Institutes of Health (NCI U54-CA-112952-01). Any opinions, findings and conclusions or recommendations expressed in this work are those of the authors and do not necessarily reflect the views of the NSF or NIH.