August 2000
In a simple model, the evolution of proteins can be viewed as the accumulation of amino acid substitutions. Obviously, amino acids with similar chemical and physical properties are replaced by each other more often than different types. M. Dayhoff et al. (Atlas of Protein Sequences and Structure, 1978, 5, 345–352) suggested to revert this relation and describe the similarity of amino acids by their replacement frequencies. However, the problem with this statistical concept is, that replacement frequencies depend on the degree of divergence between sequences. A dynamical model is needed which describes protein evolution on a time scale.
This can be done using a Markov chain to model the evolution of each position in the protein. This Markov chain needs to be estimated from aligned sequences. Inhomogeneity in the degree of divergence inside the alignment data is the major obstacle to this goal.
In this paper we compare three approaches to this estimation problem: First, the original method by M. Dayhoff, secondly, the resolvent method (Mueller & Vingron, Journal of Computational Biology, 2000), and finally a maximum likelihood approach. We briefly review all three procedure and evaluate their performance by means of extensive simulations. The focus is on the capability of the methods to recover accurately the Markov chain underlying the simulations.
The maximum likelihood method as well as the resolvent method outperform Dayhoff's method clearly. Maximum likelihood is the method of choice in the case of small sets of input data. The resolvent method is computationally much less demanding while it performs only slightly worse than the maximum likelihood. Therefore it is perfectly appropriate for large scale applications.
Keywords: amino acid substitution models, molecular evolution, bioinformatics
The manuscript is available in PostScript and PDF formats.