AlphaFold2 (AF2) has created a breakthrough in biology by providing three-dimensional structure models for whole-proteome sequences, with unprecedented levels of accuracy. In addition, the AF2 pLDDT score, related to the model confidence, has been shown to provide a good measure of residue-wise disorder. Here, we combined AF2 predictions with pyHCA, a tool we previously developed to identify foldable segments and estimate their order/disorder ratio, from a single protein sequence. We focused our analysis on the AF2 predictions available for 21 reference proteomes (AFDB v1), in particular on their long foldable segments (>30 amino acids) that exhibit characteristics of soluble domains, as estimated by pyHCA. Among these segments, we provided a global analysis of those with very low pLDDT values along their entire length and compared their characteristics to those of segments with very high pLDDT values. We highlighted cases containing conditional order, as well as cases that could form well-folded structures but escape the AF2 prediction due to a shallow multiple sequence alignment and/or undocumented structure or fold. AF2 and pyHCA can therefore be advantageously combined to unravel cryptic structural features in whole proteomes and to refine predictions for different flavors of disorder.
Order and disorder govern protein functions, but there is a great diversity in disorder, from regions that are-and stay-fully disordered to conditional order. This diversity is still difficult to decipher even though it is encoded in the amino acid sequences. Here, we developed an analytic Python package, named pyHCA, to estimate the foldability of a protein segment from the only information of its amino acid sequence and based on a measure of its density in regular secondary structures associated with hydrophobic clusters, as defined by the hydrophobic cluster analysis (HCA) approach. The tool was designed by optimizing the separation between foldable segments from databases of disorder (DisProt) and order (SCOPe [soluble domains] and OPM [transmembrane domains]). It allows to specify the ratio between order, embodied by regular secondary structures (either participating in the hydrophobic core of well-folded 3D structures or conditionally formed in intrinsically disordered regions) and disorder. We illustrated the relevance of pyHCA with several examples and applied it to the sequences of the proteomes of 21 species ranging from prokaryotes and archaea to unicellular and multicellular eukaryotes, for which structure models are provided in the AlphaFold protein structure database. Cases of low-confidence scores related to disorder were distinguished from those of sequences that we identified as foldable but are still excluded from accurate modeling by Alpha-Fold2 due to a lack of sequence homologs or to compositional biases. Overall, our approach is complementary to AlphaFold2, providing guides to map structural innovations through evolutionary processes, at proteome and gene scales.
Order and disorder govern protein functions, but there is a great
diversity in disorder, from regions that are – and stay – fully
disordered to conditional order. This diversity is still difficult to
decipher even though it is encoded in the amino acid sequences. Here, we
developed an analytic Python package, named pyHCA, to estimate
the foldability of a protein segment from the only information of its
amino acid sequence and based on a measure of its density in regular
secondary structures associated with hydrophobic clusters, as defined by
the Hydrophobic Cluster Analysis (HCA) approach. The tool was designed
by optimizing the separation between foldable segments from databases of
disorder (DisProt) and order (SCOPe (soluble domains) and OPM
(transmembrane domains)). It allows to specify the ratio between order,
embodied by regular secondary structures (either participating in the
hydrophobic core of well-folded 3D structures or conditionally formed in
intrinsically disordered regions) and disorder. We illustrated the
relevance of pyHCA with several examples and applied it to the
sequences of the proteomes of 21 species ranging from prokaryotes and
archaea to unicellular and multicellular eukaryotes, for which structure
models are provided in the AlphaFold2 databases. Cases of low-confidence
scores related to disorder were distinguished from those of sequences
that we identified as foldable but are still excluded from accurate
modeling by AlphaFold2 due to a lack of sequence homologs or to
compositional biases. Overall, our approach is complementary to
AlphaFold2, providing guides to map structural innovations through
evolutionary processes, at proteome and gene scales.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.