2022
DOI: 10.1101/2022.07.22.501063
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Harnessing machine translation methods for sequence alignment

Abstract: The sequence alignment problem is one of the most fundamental problems in bioinformatics and a plethora of methods were devised to tackle it. Here we introduce BetaAlign, a novel methodology for aligning sequences using a natural language processing (NLP) approach. BetaAlign accounts for the possible variability of the evolutionary process among different datasets by using an ensemble of transformers, each trained on millions of samples generated from a different evolutionary model. Our approach leads to outst… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
8
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
2
2

Relationship

1
3

Authors

Journals

citations
Cited by 4 publications
(8 citation statements)
references
References 50 publications
0
8
0
Order By: Relevance
“…Relaxing the indel size assumption in PIP and introducing Zipf into the model may further improve its accuracy. In addition, we recently developed a novel method for aligning sequences based on natural language processing deep-learning architectures ( Dotan et al 2023 : https://openreview.net/forum?id=8efJYMBrNb ). The strength of this approach is that it is often time easier to simulate complex evolution phenomena rather than model them or calculate their corresponding penalty.…”
Section: Discussionmentioning
confidence: 99%
“…Relaxing the indel size assumption in PIP and introducing Zipf into the model may further improve its accuracy. In addition, we recently developed a novel method for aligning sequences based on natural language processing deep-learning architectures ( Dotan et al 2023 : https://openreview.net/forum?id=8efJYMBrNb ). The strength of this approach is that it is often time easier to simulate complex evolution phenomena rather than model them or calculate their corresponding penalty.…”
Section: Discussionmentioning
confidence: 99%
“…The third dataset contains pairwise homologous nucleotide sequences, and the task is to correctly align them. We have previously developed a deep-learning-based algorithm for such an alignment task, in which we train transformers to map pairs of unaligned sequences, i.e., source sentences, into a valid alignment, i.e., target sentences (Dotan et al 2023). The average number of nucleotides is 429 and 434 for the source and target sentences, respectively.…”
Section: Dataset2mentioning
confidence: 99%
“…In addition, each alignment row should be identical to the original corresponding (unaligned) sequence after removing all of it gaps. In rare cases, this is not the case, and these alignments are also considered invalid (Dotan et al 2023). Of note, all alignments in the training data are valid alignment.…”
Section: ‫݁݃ܽݎ݁ݒܥ‬ ൌ ‫ܣܸ‬ ‫ܣܶ‬mentioning
confidence: 99%
See 2 more Smart Citations