Harnessing machine translation methods for sequence alignment

Dotan, Edo; Belinkov, Yonatan; Avram, Oren; Wygoda, Elya; Ecker, Noa; Alburquerque, Michael; Keren, Omri; Loewenthal, Gil; Pupko, Tal

doi:10.1101/2022.07.22.501063

Cited by 4 publications

(8 citation statements)

References 50 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Relaxing the indel size assumption in PIP and introducing Zipf into the model may further improve its accuracy. In addition, we recently developed a novel method for aligning sequences based on natural language processing deep-learning architectures ( Dotan et al 2023 : https://openreview.net/forum?id=8efJYMBrNb ). The strength of this approach is that it is often time easier to simulate complex evolution phenomena rather than model them or calculate their corresponding penalty.…”

Section: Discussionmentioning

confidence: 99%

Statistical framework to determine indel-length distribution

Wygoda,

Loewenthal,

Moshe

et al. 2024

Bioinformatics

View full text Add to dashboard Cite

Motivation Insertions and deletions (indels) of short DNA segments, along with substitutions, are the most frequent molecular evolutionary events. Indels were shown to affect numerous macro-evolutionary processes. Because indels may span multiple positions, their impact is a product of both their rate and their length distribution. An accurate inference of indel length distribution is important for multiple evolutionary and bioinformatics applications, most notably for alignment software. Previous studies counted the number of continuous gap characters in alignments to determine the best fitting length distribution. However, gap-counting methods are not statistically rigorous, as gap blocks are not synonymous with indels. Furthermore, such methods rely on alignments that regularly contain errors and are biased due to the assumption of alignment methods that indels lengths follow a geometric distribution Results We aimed to determine which indel length distribution best characterizes alignments using statistical rigorous methodologies. To this end, we reduced the alignment bias using a machine-learning algorithm and applied an Approximate Bayesian Computation methodology for model selection. Moreover, we developed a novel method to test if current indel models provide an adequate representation of the evolutionary process. We found that the best fitting model varies among alignments, with a Zipf length distribution fitting the vast majority of them. Supplementary information Supplementary data are available at Bioinformatics online.

show abstract

Section: Discussionmentioning

confidence: 99%

Statistical framework to determine indel-length distribution

Wygoda,

Loewenthal,

Moshe

et al. 2024

Bioinformatics

View full text Add to dashboard Cite

show abstract

“…The third dataset contains pairwise homologous nucleotide sequences, and the task is to correctly align them. We have previously developed a deep-learning-based algorithm for such an alignment task, in which we train transformers to map pairs of unaligned sequences, i.e., source sentences, into a valid alignment, i.e., target sentences (Dotan et al 2023). The average number of nucleotides is 429 and 434 for the source and target sentences, respectively.…”

Section: Dataset2mentioning

confidence: 99%

“…In addition, each alignment row should be identical to the original corresponding (unaligned) sequence after removing all of it gaps. In rare cases, this is not the case, and these alignments are also considered invalid (Dotan et al 2023). Of note, all alignments in the training data are valid alignment.…”

Section: ‫݁݃ܽݎ݁ݒܥ‬ ൌ ‫ܣܸ‬ ‫ܣܶ‬mentioning

confidence: 99%

“…These characters are the building blocks of sophisticated structures, i.e., text and genomes, which include elements such as sentences and genes, respectively. Although NLP architectures can be adapted to biological problems, considerable differences remain between human language and genomic data (Yu et al 2019;List et al 2016;Dotan et al 2023). Among the major differences are the sequence length and the size of the dictionary, i.e., the entire set of tokens used in that language.…”

Section: Introductionmentioning

confidence: 99%

“…Long sequences raise memory consumption and run-time challenges when analyzed using deep-learning networks. Different approaches to tackle these issues have emerged, including: (1) Developing specific architectures for long sequences (Lin et al 2021;Rao et al 2021) ; (2) Splitting the data into smaller segments (Dotan et al 2023); (3) K-mer representation of all possible nucleotides (Ji et al 2021).…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Effect of Tokenization on Transformers for Biological Sequences

Dotan,

Jaschek,

Pupko

et al. 2023

Preprint

View full text Add to dashboard Cite

Deep learning models are transforming biological research. Many bioinformatics and comparative genomics algorithms analyze genomic data, either DNA or protein sequences. Examples include sequence alignments, phylogenetic tree inference and automatic classification of protein functions. Among these deep learning algorithms, models for processing natural languages, developed in the natural language processing (NLP) community, were recently applied to biological sequences. However, biological sequences are different than natural languages, such as English, and French, in which segmentation of the text to separate words is relatively straightforward. Moreover, biological sequences are characterized by extremely long sentences, which hamper their processing by current machine-learning models, notably the transformer architecture. In NLP, one of the first processing steps is to transform the raw text to a list of tokens. Deep-learning applications to biological sequence data mostly segment proteins and DNA to single characters. In this work, we study the effect of alternative tokenization algorithms on eight different tasks in biology, from predicting the function of proteins and their stability, through nucleotide sequence alignment, to classifying proteins to specific families. We demonstrate that applying alternative tokenization algorithms can increase accuracy and at the same time, substantially reduce the input length compared to the trivial tokenizer in which each character is a token. Furthermore, applying these tokenization algorithms allows interpreting trained models, taking into account dependencies among positions. Finally, we trained these tokenizers on a large dataset of protein sequences containing more than 400 billion amino acids, which resulted in over a three-fold decrease in the number of tokens. We then tested these tokenizers trained on large-scale data on the above specific tasks and showed that for some tasks it is highly beneficial to train database-specific tokenizers. Our study suggests that tokenizers are likely to be a critical component in future deep-network analysis of biological sequence data.

show abstract

OM2Seq: learning retrieval embeddings for optical genome mapping

Nogin,

Sapir,

Zur

et al. 2024

Bioinformatics Advances

View full text Add to dashboard Cite

Motivation Genomics-based diagnostic methods that are quick, precise, and economical are essential for the advancement of precision medicine, with applications spanning the diagnosis of infectious diseases, cancer, and rare diseases. One technology that holds potential in this field is optical genome mapping (OGM), which is capable of detecting structural variations, epigenomic profiling, and microbial species identification. It is based on imaging of linearized DNA molecules that are stained with fluorescent labels, that are then aligned to a reference genome. However, the computational methods currently available for OGM fall short in terms of accuracy and computational speed. Results This work introduces OM2Seq, a new approach for the rapid and accurate mapping of DNA fragment images to a reference genome. Based on a Transformer-encoder architecture, OM2Seq is trained on acquired OGM data to efficiently encode DNA fragment images and reference genome segments to a common embedding space, which can be indexed and efficiently queried using a vector database. We show that OM2Seq significantly outperforms the baseline methods in both computational speed (by two orders of magnitude) and accuracy. Availability and implementation https://github.com/yevgenin/om2seq

show abstract

Harnessing machine translation methods for sequence alignment

Cited by 4 publications

References 50 publications

Statistical framework to determine indel-length distribution

Statistical framework to determine indel-length distribution

Effect of Tokenization on Transformers for Biological Sequences

OM2Seq: learning retrieval embeddings for optical genome mapping

Contact Info

Product

Resources

About