Discriminative Variable Subsets in Bayesian Classification with Mixture Models

Lynn Lin, Cliburn Chan and Mike West

Current manuscript version represents a major revision completed in July 2013

We discuss the evaluation of subsets of variables for the discriminative evidence they provide in multivariate mixture modeling for classification. Novel development of Bayesian classification analysis uses a natural measure of concordance between mixture component densities, and defines an effective and computationally feasible method for assessing and prioritizing subsets of variables according to their roles in discrimination of one or more mixture components. We relate the new discriminative information measures to Bayesian classification probabilities and error rates, and exemplify their use in Bayesian analysis of Dirichlet process mixture models fitted via Markov chain Monte Carlo methods as well as using a novel Bayesian expectation-maximization algorithm. We present a series of theoretical and simulated data examples to fix concepts and exhibit the utility of the approach, and compare with prior approaches. We demonstrate application in the context of automatic classification and discriminative variable selection in high-throughput systems biology using large flow cytometry data sets.

Keywords: Bayesian expectation-maximization, Bayesian mixture models, Classification error rates, Concordance of densities, Dirichlet process mixtures, Discriminative information measure, Discriminative threshold probabilities, Flow cytometry data, Non-Gaussian component mixtures, Variable subset selection


The authors are grateful to David Murdoch, Janet Staats and Kent Weinhold for providing the flow cytometry data set and for discussions of the biological context of that motivating study (Section 5). Aspects of this research were supported by grants from the U.S. National Institutes of Health (P50-GM081883 and RC1-AI086032) and the National Science Foundation (DMS-1106516). Any opinions, findings and conclusions or recommendations expressed in this work are those of the authors and do not necessarily reflect the views of the NIH or NSF.