2021
DOI: 10.1101/2021.12.03.470944
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Protein length distribution is remarkably consistent across Life

Abstract: In every living species, the function of a protein depends on its organisation of structural domains, and the length of a protein is a direct reflection of this. Because every species evolved under different evolutionary pressures, the protein length distribution, much like other genomic features, is expected to vary across species. Here we evaluated this diversity by comparing protein length distribution across 2,326 species (1,688 bacteria, 153 archaea and 485 eukaryotes). We found that proteins tend to be o… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
7
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
3

Relationship

1
5

Authors

Journals

citations
Cited by 8 publications
(7 citation statements)
references
References 60 publications
0
7
0
Order By: Relevance
“…While natural proteins often span between 200 to 400 residues [33], ESM therm is fine-tuned on sequences no longer than 72 residues in length. To explore its performance under this limitation, we benchmarked our model against six stability-related datasets on larger proteins and compared our results with state-of-the-art covering different methodologies.…”
Section: Resultsmentioning
confidence: 99%
“…While natural proteins often span between 200 to 400 residues [33], ESM therm is fine-tuned on sequences no longer than 72 residues in length. To explore its performance under this limitation, we benchmarked our model against six stability-related datasets on larger proteins and compared our results with state-of-the-art covering different methodologies.…”
Section: Resultsmentioning
confidence: 99%
“…Our findings from mammalian data strongly point to longer insertion lengths than deletion lengths. Further, given the higher prevalence of deletions and the remarkable uniformity of protein length distribution across the tree of life (Nevers et al, 2023), it is conceivable that the two distributions differ, with deletions lengths having a smaller mode than insertions. Recent work from Tal Pupko’s lab is a notable step in the direction of inferring indel length distributions based on event reconstruction (Wygoda et al, 2024).…”
Section: Discussionmentioning
confidence: 99%
“…Once these regions were identified for the F ST and ZH p analyses, we extracted all protein‐coding genes located within the genomic segment encompassing these regions plus 50 kb upstream and downstream from the mid‐position marker (where a region comprised a single marker) or from the flanking markers of the region (where a region comprised multiple markers). The selection of 50 kb was based on the known distribution of gene lengths in the O. niloticus genome, shown to be <50 kb for over 90% of genes (Nevers et al, 2021 ). There was no justification for extending the candidate regions beyond this size to account for linkage disequilibrium (LD), as often done for terrestrial domesticated species, as LD has been shown to be very low in Nile tilapia (Peñaloza et al, 2020 ).…”
Section: Methodsmentioning
confidence: 99%