2022
DOI: 10.1093/bib/bbac204
|View full text |Cite
|
Sign up to set email alerts
|

ViBE: a hierarchical BERT model to identify eukaryotic viruses using metagenome sequencing data

Abstract: Viruses are ubiquitous in humans and various environments and continually mutate themselves. Identifying viruses in an environment without cultivation is challenging; however, promoting the screening of novel viruses and expanding the knowledge of viral space is essential. Homology-based methods that identify viruses using known viral genomes rely on sequence alignments, making it difficult to capture remote homologs of the known viruses. To accurately capture viral signals from metagenomic samples, models are… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
12
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 16 publications
(12 citation statements)
references
References 38 publications
0
12
0
Order By: Relevance
“…We set up a masked language modeling task [19], in which 15% of the tokens in a nucleotide sequence were masked and had to be predicted from their context. In contrast with most DNA language models that tokenize sequences into overlapping k-mers [24, 26, 28] or use byte-pair encoding [23], we used bare nucleotides as tokens. While a thorough benchmark of different tokenization strategies is lacking, using single-nucleotide tokens makes interpretation easier, in particular for zero-shot variant effect prediction.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…We set up a masked language modeling task [19], in which 15% of the tokens in a nucleotide sequence were masked and had to be predicted from their context. In contrast with most DNA language models that tokenize sequences into overlapping k-mers [24, 26, 28] or use byte-pair encoding [23], we used bare nucleotides as tokens. While a thorough benchmark of different tokenization strategies is lacking, using single-nucleotide tokens makes interpretation easier, in particular for zero-shot variant effect prediction.…”
Section: Methodsmentioning
confidence: 99%
“…Large language models are capable of predicting missense variant effects in a zero-shot manner, i.e., without any further training on labeled data. However, language models trained on non-coding DNA sequences [23, 24, 25, 26, 27, 28, 29] have not yet demonstrated the ability to make accurate zero-shot variant effect predictions.…”
Section: Introductionmentioning
confidence: 99%
“…New powerful Transformer-based models like BERT ( [103]) are tested on metagenomic datasets, although adaptations of these approaches to metagenomics are still very recent and the field remains young and very active. Experimental approaches such as BERTax ( [105]) and ViBE ( [168]) aim to use these powerful models. BERTax, for instance, uses the power of BERT-like models to "learn a representation of the DNA language" and design several taxonomic classification models, at different taxonomic levels (superkingdom, phylum and genus).…”
Section: Discussionmentioning
confidence: 99%
“…Although effective these architectures have been outperformed by transformer architectures as evidenced by their adoption in both the computer vision and natural language processing fields [15,16,17]. To address these limitations and exploit the transformer architecture, large language models (LLMs) have gained considerable popularity in the field of biology [18,19,20,21,22,23,24,25,26,27], offering the ability to be trained on unlabeled data and gener-ate general-purpose representations capable of solving specific tasks. Furthermore, LLMs overcome a current limitation of other deep learningbased models, as they are not reliant on single reference genomes, which often provide an incomplete and biased genomic diversity depiction from a limited number of individuals.…”
Section: Introductionmentioning
confidence: 99%