Abstract. Using a maximum-likelihood formalism, we have developed a method with which to reconstruct the sequences of ancestral proteins. Our approach allows the calculation of not only the most probable ancestral sequence but also of the probability of any amino acid at any given node in the evolutionary tree. Because we consider evolution on the amino acid level, we are better able to include effects of evolutionary pressure and take advantage of structural information about the protein through the use of mutation matrices that depend on secondary, structure and surface accessibility. The computational complexity of this method scales linearly with the number of homologous proteins used to reconstruct the ancestral sequence.
Substitution matrices are a key tool in important applications such as identifying sequence homologies, creating sequence alignments and more recently using evolutionary patterns for the prediction of protein structure. We have derived a novel approach to the derivation of these matrices that utilizes not only multiple sequence alignments, but also the associated evolutionary trees. The key to our method is the use of a Bayesian formalism to calculate the probability that a given substitution matrix fits the tree structures and multiple sequence alignment data. Using this procedure, we can determine optimal substitution matrices for various local environments, depending on parameters such as secondary structure and surface accessibility.
HIV-1 subtype phylogeny is investigated using a previously developed computational model of natural amino acid site substitutions. This model, based on Boltzmann statistics and Metropolis kinetics, involves an order of magnitude fewer adjustable parameters than traditional substitution matrices and deals more effectively with the issue of protein site heterogeneity. When optimized for sequences of HIV-1 envelope (env) proteins from a few specific subtypes, our model is more likely to describe the evolutionary record for other subtypes than are methods using a single substitution matrix, even a matrix optimized over the same data. Pairwise distances are calculated between various probabilistic ancestral subtype sequences, and a distance matrix approach is used to find the optimal phylogenetic tree. Our results indicate that the relationships between subtypes B, C, and D and those between subtypes A and H may be closer than previously thought.
New computational models of natural site mutations are developed that account for the different selective pressures acting on different locations in the protein. The number of adjustable parameters is greatly reduced by basing the models on the underlying physical-chemical properties of the amino acids. This allows us to use our method on small data sets built of specific protein types. We demonstrate that with this approach we can represent the evolutionary patterns in HIV envelope proteins far better than with more traditional methods.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.