2024
DOI: 10.1101/2024.02.14.580341
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Decoding functional proteome information in model organisms using protein language models

Israel Barrios-Núñez,
Gemma I. Martínez-Redondo,
Patricia Medina-Burgos
et al.

Abstract: The gap between sequence information and experimental determination of function in proteins is currently unsolvable. The computational and automatic implementation of function prediction is not significantly improving function assignment, so there is a pressing need to find alternative methods, since standard approaches are not able to bridge the gap. Modern machine learning methods have been recently developed to predict function, being deep neural convolutional networks a popular choice. Protein language mod… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
6
0

Year Published

2024
2024
2024
2024

Publication Types

Select...
2

Relationship

2
0

Authors

Journals

citations
Cited by 2 publications
(6 citation statements)
references
References 49 publications
0
6
0
Order By: Relevance
“…This is why faster sequence-based functional inference is a good alternative for research questions involving a high number of species (proteomes) with diverse types of proteins. In a previous study, we assessed the performance of several AI-based functional annotation methods when inferring functions in model organisms and found that protein language models provided a good 11 alternative in terms of coverage, informativeness, and precision when compared to deep learning-based and profile-based tools 17 . Here, we extended that comparison to non-model organisms across the animal kingdom and assessed how AI-based methods behaved when compared to the widely used homology-based eggNOG-mapper.…”
Section: Discussionmentioning
confidence: 99%
See 3 more Smart Citations
“…This is why faster sequence-based functional inference is a good alternative for research questions involving a high number of species (proteomes) with diverse types of proteins. In a previous study, we assessed the performance of several AI-based functional annotation methods when inferring functions in model organisms and found that protein language models provided a good 11 alternative in terms of coverage, informativeness, and precision when compared to deep learning-based and profile-based tools 17 . Here, we extended that comparison to non-model organisms across the animal kingdom and assessed how AI-based methods behaved when compared to the widely used homology-based eggNOG-mapper.…”
Section: Discussionmentioning
confidence: 99%
“…For the following analyses, we decided to proceed with the unfiltered annotated set of GOs inferred by GOPredSim-ProtT5. This decision was based on 4 factors: (1) reliability index thresholds were estimated based on a very reduced dataset strongly biased towards model organisms, and their application may impact considerably the amount of information we retrieve for non-model species, (2) reducing the reliability index values in model organisms by removing those species from the lookup dataset did not change the functional annotation, (3) information content of GO terms predicted by GOPredSim-ProtT5 is independent of their reliability index 17 , and (4) we identified genes with GO terms with low reliability index aligning well with the results of GOPredSim-ProtT5 and showing experimentally confirmed functions. Firstly, the human pseudogene LOFF_G0001716 (or AL627230.1), found to be differentially expressed in a bone mechanical response experiment 20 , was inferred to have a role in wound healing after annotation with GOPredSim-ProtT5 (GO:0042060), a term related to angiogenesis, a prerequisite for bone formation.…”
Section: Gopredsim With the Protein Language Model Prott5 Annotates V...mentioning
confidence: 99%
See 2 more Smart Citations
“…To help alleviate these issues, we previously published the Metazoan Assemblies from Transcriptomic Ensembles database (MATEdb) containing high-quality transcriptome assemblies for 335 arthropods and mollusks (Fernández et al, 2022). Here, we present its second version, MATEdb2, that differs from the previous one in three main aspects: (1) we have increased the taxonomic sampling to all animal phyla with high-quality data publicly available, and provide the first transcriptomic sequences for some animal taxa; (2) we include a standardized pipeline for obtaining the protein sequences from GFF genomic files instead of adding the precomputed protein files publicly available, making it easier to replicate and combine with the associated genomic sequence; (3) we provide the functional annotation of all proteins using a language-based new methodology that outperforms traditional methods (Barrios-Núñez et al, 2024). We hope that this newer version of MATEdb accelerates research on animal evolution by providing a wider taxonomic resource of high-quality proteomes across the Animal Tree of Life.…”
Section: Introductionmentioning
confidence: 99%