2023
DOI: 10.1101/2023.01.11.523679
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics

Abstract: Closing the gap between measurable genetic information and observable traits is a longstanding challenge in genomics. Yet, the prediction of molecular phenotypes from DNA sequences alone remains limited and inaccurate, often driven by the scarcity of annotated data and the inability to transfer learnings between prediction tasks. Here, we present an extensive study of foundation models pre-trained on DNA sequences, named the Nucleotide Transformer, integrating information from 3,202 diverse human genomes, as w… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

3
136
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
5
4

Relationship

1
8

Authors

Journals

citations
Cited by 99 publications
(182 citation statements)
references
References 71 publications
3
136
0
Order By: Relevance
“…Parallel to our work, another study was released which also trains on genomes of multiple species and focuses on human for downstream tasks (Dalla-Torre et al .,2023). The proposed model is species-agnostic but more than one thousand times larger than ours.…”
Section: Discussionmentioning
confidence: 99%
“…Parallel to our work, another study was released which also trains on genomes of multiple species and focuses on human for downstream tasks (Dalla-Torre et al .,2023). The proposed model is species-agnostic but more than one thousand times larger than ours.…”
Section: Discussionmentioning
confidence: 99%
“…This is where the integration of protein language model ProtTrans has significantly enhanced the utility and effectiveness of the reelGene framework and opened up new avenues for protein functionality research. DNA language models are rapidly advancing, but currently, they are mostly pre-trained on human 54 and vertebrate genomes 55 . Models pre-trained on plant genomes may improve our mRNA and junction boundary models.…”
Section: Discussionmentioning
confidence: 99%
“…Although effective these architectures have been outperformed by transformer architectures as evidenced by their adoption in both the computer vision and natural language processing fields [15,16,17]. To address these limitations and exploit the transformer architecture, large language models (LLMs) have gained considerable popularity in the field of biology [18,19,20,21,22,23,24,25,26,27], offering the ability to be trained on unlabeled data and gener-ate general-purpose representations capable of solving specific tasks. Furthermore, LLMs overcome a current limitation of other deep learningbased models, as they are not reliant on single reference genomes, which often provide an incomplete and biased genomic diversity depiction from a limited number of individuals.…”
Section: Introductionmentioning
confidence: 99%
“…Furthermore, LLMs overcome a current limitation of other deep learning-based models, as they are not reliant on single reference genomes, which often provide an incomplete and biased genomic diversity depiction from a limited number of individuals. LLMs can leverage multiple reference genomes, including those from genetically distant species, thereby increasing overall diversity, which has been shown to significantly enhance prediction performance [24]. This diversity is particularly relevant in plant species due to the structural complexities of their genomes, which hinder accurate mapping of polymorphisms across whole-genome alignments.…”
Section: Introductionmentioning
confidence: 99%