The rise in the number of protein sequences in the post-genomic era has led to a major breakthrough in fitting generative sequence models for contact prediction, protein design, alignment, and homology search. Despite this success, the interpretability of the modeled pairwise parameters continues to be limited due to the entanglement of coevolution, phylogeny, and entropy. For contact prediction, post-correction methods have been developed to remove the contribution of entropy from the predicted contact maps. However, all remaining applications that rely on the raw parameters, lack a direct method to correct for entropy. In this paper, we investigate the origins of the entropy signal and propose a new spectral regularizer to down weight it during model fitting. We find the added regularizer to GREMLIN, a Markov Random Field or Potts model, allows for the inference of a sparse contact map without loss in precision, meanwhile improving interpretability, and resolving overfitting issues important for sequence evaluation and design.
1AbstractRecent methods have shown promise in using pairwise sequence coevolution predictions to illuminate physical interactions and functional relationships between pairs of protein residues. As a result, there has been an increased interest in identifying higher-order correlations between sequence positions in an effort to further understand how the multiple sequence alignment (MSA) encodes the conserved biological properties of a protein family. To this end, we propose a robust and generalizable spectral clustering model that can extract interconnected networks of coevolving residues – termed “protein sectors” – using pairwise sequence coevolution predicted by a global statistical model. We assess the statistical and evolutionary origins for protein sectors extracted from the MSA for 120 protein families. We show that protein sectors are extracted from a subset of densely connected components in the sequence coevolution matrix, many of which are not present in the pairwise residue contact graph that is constructed from the protein crystal structure, revealing the existence of networks of paired residues that are not necessarily in direct physical contact but are nonetheless evolutionarily coupled. We found that protein sectors form structurally connected entities in three-dimensional space, despite sector identification being independent of protein crystal structure. Interestingly, protein families with high structural similarity do not share similar protein sectors, suggesting that nuances in the sequence coevolution matrix can differentiate between the evolutionary histories of structurally-related protein families.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.