2024
DOI: 10.1101/2024.02.28.582465
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Illuminating the functional landscape of the dark proteome across the Animal Tree of Life through natural language processing models

Gemma I. Martínez-Redondo,
Israel Barrios-Núñez,
Marçal Vázquez-Valls
et al.

Abstract: Understanding how coding genes and their functions evolve over time is a key aspect of evolutionary biology. Protein coding genes poorly understood or characterized at the functional level may be related to important evolutionary innovations, potentially leading to incomplete or inaccurate models of evolutionary change, and limiting the ability to identify conserved or lineage-specific features. Homology-based methodologies often fail to transfer functional annotations in a large fraction of the coding gene re… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
5
0

Year Published

2024
2024
2024
2024

Publication Types

Select...
4
2

Relationship

3
3

Authors

Journals

citations
Cited by 6 publications
(5 citation statements)
references
References 77 publications
0
5
0
Order By: Relevance
“…These results may reflect the focus of different scientific communities on particular genes or gene families, with different groups of genes having been studied more intensely, while others remain relatively overlooked ( 2 ). Overall, the LMs (GOPredSim-T5 and GOPredSim-SeqVec) provided a larger and more consistent coverage of the predicted functions for proteins across organisms, regardless of their annotation stage, making them promising tools to address functional studies in non-model organisms as already indicated in the animal phyla ( 40 ).…”
Section: Resultsmentioning
confidence: 93%
See 1 more Smart Citation
“…These results may reflect the focus of different scientific communities on particular genes or gene families, with different groups of genes having been studied more intensely, while others remain relatively overlooked ( 2 ). Overall, the LMs (GOPredSim-T5 and GOPredSim-SeqVec) provided a larger and more consistent coverage of the predicted functions for proteins across organisms, regardless of their annotation stage, making them promising tools to address functional studies in non-model organisms as already indicated in the animal phyla ( 40 ).…”
Section: Resultsmentioning
confidence: 93%
“…Our study demonstrates that LMs may be particularly useful to examine poorly annotated genomes, as exemplified here in C. elegans , both at the genome level and in terms of the functional retrieval of DEGs. We also performed large-scale testing of LMs, in particular applying the GOPredSim-T5 annotation of 2.4 million to genes in various non-model organisms in the animal phyla, where we found interesting biological insights [see ( 40 )].…”
Section: Discussionmentioning
confidence: 99%
“…Our study demonstrates that LMs may be particularly useful to examine poorly annotated genomes, as exemplified here in C. elegans, both at the genome level and in terms of the functional retrieval of DEGs. We also performed largescale testing of LMs, in particular applying the GOPredSim-T5 annotation of 2.4 million to genes in various non-model organisms in the animal Phyla, where we found interesting biological insights (see 40).…”
Section: Discussionmentioning
confidence: 99%
“…Gene repertoire evolutionary dynamics across nodes were estimated with pyHam 83 and curated with custom scripts following the workflow available at the GitHub repository (https://github.com/MetazoaPhylogenomicsLab/Vargas-Chavez_et_al_2024_Chromosome_shat tering_clitellate_origins). Longest isoforms in protein mode were functionally annotated with both homology-based methods (eggNOG-mapper v2 84 ) and natural language processing ones (FANTASIA 85 ). Using the Gene Ontology (GO) annotations inferred with FANTASIA, GO enrichment analysis of genes loss in each internode were calculated using the Fisher test and the elim algorithm as implemented in the topGO package 86 .…”
Section: Gene Repertoire Evolutionary Dynamicsmentioning
confidence: 99%