DNA MICROARRAY DATA ANALYSIS AND REGRESSION MODELING FOR GENETIC EXPRESSION PROFILING
Mike West, Joseph R Nevins, Jeffrey R Marks, Rainer Spang & Harry Zuzan
Duke University
May 2000, revised August 2000
We report on our studies in large-scale gene expression profiling using DNA microarray
data. The problem of molecular phenotyping -- linking observed genetic expression
profiles to identified physiological or clinical states and outcomes -- is one of
simply critical importance for improved understanding of disease progression and for
improved therapies. Our main application here is in breast cancer, where interest
lies in identifying characteristics of genetic expression, involving possibly very
many genes, that are useful in predictive discrimination between cancer states.
In this applied setting, we frame the problem as one of {\sl predictive
discrimination}, and approach its solution in a binary regression modeling
framework. Breast cancers are clinically or histologically identified into one
of two outcome groups, and analysis aims to produce binary regression models for
outcomes based on observed genetic expression data measured via high-density DNA
microarrays using RNA extracted from the tumors. The formal
statistical problem is ill-posed, since tumor sample sizes are substantially
smaller than the number of available and potentially interesting explanatory
variables -- i.e., we are in the Large p, Small n paradigm. We address
this in a Bayesian framework using singular-value regression ideas and
novel classes of informative and structured prior distributions for the
very high-dimensional regression parameters. The singular-value decomposition
analysis of design matrices of expression measures is also valuable in exploratory
analysis of large-scale expression array data sets. In the context of our breast
cancer study we discuss the methodology, implementation of our Bayesian analysis,
aspects of model specification, aspects of posterior and predictive analysis,
model assessment and validation issues.
We develop detailed studies of the breast data in connection with estrogen receptor
status as the defined outcome of interest, and highlight significant scientific
findings as well as aspects of predictive discrimination performance using the
regression approach. We also discuss similar issues in a benchmark leukemia study.
%and finally comment on current research and application challenges.
Keywords: Bayesian bioinformatics, Bayesian regression analysis, binary regression,
breast cancer, estrogen receptor status, gene expression profiles, DNA microarrays,
molecular phenotyping, singular value decompositions
The manuscript is available in postscript and
pdf formats