DNA MICROARRAY DATA ANALYSIS AND REGRESSION MODELING FOR GENETIC EXPRESSION PROFILING

Mike West, Joseph R Nevins, Jeffrey R Marks, Rainer Spang & Harry Zuzan

Duke University

May 2000, revised August 2000

We report on our studies in large-scale gene expression profiling using DNA microarray data. The problem of molecular phenotyping -- linking observed genetic expression profiles to identified physiological or clinical states and outcomes -- is one of simply critical importance for improved understanding of disease progression and for improved therapies. Our main application here is in breast cancer, where interest lies in identifying characteristics of genetic expression, involving possibly very many genes, that are useful in predictive discrimination between cancer states. In this applied setting, we frame the problem as one of {\sl predictive discrimination}, and approach its solution in a binary regression modeling framework. Breast cancers are clinically or histologically identified into one of two outcome groups, and analysis aims to produce binary regression models for outcomes based on observed genetic expression data measured via high-density DNA microarrays using RNA extracted from the tumors. The formal statistical problem is ill-posed, since tumor sample sizes are substantially smaller than the number of available and potentially interesting explanatory variables -- i.e., we are in the Large p, Small n paradigm. We address this in a Bayesian framework using singular-value regression ideas and novel classes of informative and structured prior distributions for the very high-dimensional regression parameters. The singular-value decomposition analysis of design matrices of expression measures is also valuable in exploratory analysis of large-scale expression array data sets. In the context of our breast cancer study we discuss the methodology, implementation of our Bayesian analysis, aspects of model specification, aspects of posterior and predictive analysis, model assessment and validation issues. We develop detailed studies of the breast data in connection with estrogen receptor status as the defined outcome of interest, and highlight significant scientific findings as well as aspects of predictive discrimination performance using the regression approach. We also discuss similar issues in a benchmark leukemia study. %and finally comment on current research and application challenges.

Keywords: Bayesian bioinformatics, Bayesian regression analysis, binary regression, breast cancer, estrogen receptor status, gene expression profiles, DNA microarrays, molecular phenotyping, singular value decompositions


The manuscript is available in postscript and pdf formats