Many applications require the calculation of site-specific evolutionary rates from alignments of amino-acid sequences. For example, catalytic residues in enzymes and interface regions in protein complexes can be inferred from observed relative rates. While numerous approaches exist to calculate amino-acid rates, however, it is not entirely clear what physical quantities the inferred rates represent and how these rates relate to the underlying fitness landscape of the evolving protein. Further, amino-acid rates can be calculated in the context of different amino-acid exchangeability matrices, such as JTT, LG, or WAG, and again it is not known how the choice of the matrix influences the physical interpretation of the inferred rates. Here, we develop a theory of measurement for site-specific evolutionary rates, but analytically solving the maximum-likelihood equations for rate inference performed on sequences evolved under a mutation-selection model. We demonstrate that the measurement process can only recover the true expected rates of the mutation-selection model if rates are measured relative to a naïve exchangeability matrix, in which all exchangeabilities are equal to one. Rate measurements using other matrices are quantitatively close but not mathematically correct. Our results demonstrate that insights obtained from phylogenetic-tree inference do not necessarily apply to rate inference, and best practices for the former may be deleterious for the latter.
Expanded summaryDifferent sites in a protein evolve at different rates [1,2]. The heterogeneity in rates within a protein sequences is caused by the interplay of functional and structural constraints [3]. For instance, active sites are generally very conserved [4,5]. The protein core tends to be more conserved than the surface, presumably because mutations in the core are more likely to disturb the protein structure [6,7]. Because the evolutionary rates correspond to structurally and functionally important sites, having a reliable and accurate method for inferring site-wise rate of evolution is essential. Specifically, in most viral populations, proteins evolve very rapidly. In influenza, mutations in one site of a surface protein hemagglutinin allow the virus to escape host antibodies and propagate. Thus, detecting sitewise rates of evolution can be crucial to the efforts of viral surveillance and control.Many methods to infer site-wise rate have been developed over the years. These methods employ a substitution matrix, which captures exchangeabilities between all pairs of amino acids. The substitution matrices are made by analyzing large protein data sets and even protein sequences specific to an organism or an organelle. However, it remains an open question which substitution matrix is the most suitable for site-wise rate inference. When we measure site-wise rate, rate is defined as a scalar in front of the substitution matrix. The substitution matrix serves as a ruler by which we measure the rate of evolution at a site. Depending on what ruler or...