Paper to appear in the Journal of the American Statistical Association, 2007
Model search in regression with very large numbers of candidate predictors raise challenges for both model specification and computation. When faced with hundreds or thousands of potential predictor variables, standard approaches such as Markov chain Monte Carlo (MCMC) and step-wise variable selection exploration are often infeasible or simply ineffective. New methods of searching for ``interesting'' regions of the resulting, very high-dimensional model spaces should involve strategies that explore local collinearity structure and also quickly identify regions of high posterior probability. We describe a shotgun stochastic search (SSS) approach with such goals. The development includes both algorithmic and modelling aspects, with discussion of priors over the model space that induce sparsity and parsimony over and above the traditional dimension penalisation implicit in Bayesian and likelihood analyses. Increasing focus on model constraints, whether through sparsity inducing priors or other devices, are emerging as key as dimension scales in applications such as arise in genomics. Our method takes advantage of parallel computation using cluster computers, and we present an example arising from a gene expression survival study in brain cancer. We also evaluate theoretical and simulation-based aspects of performance characteristics of SSS in large-scale regression model search, and provide links to a computer code archive for access to the software implementing the methods.
Key Words: Bayesian model averaging, gene expression, parallel computing, predictive modelling, regression model uncertainty, stochastic search, variable selection
Research partially supported by grants from the National Science Foundation and National Institutes of Health. Any opinions, findings and conclusions or recommendations expressed in this work are those of the authors and do not necessarily reflect the views of the NSF or NIH.