Biological pairwise sequence alignment can be used as a method for arranging two biological sequence characters to identify regions of similarity. This operation has elicited considerable interest due to its significant influence on various critical aspects of life (e.g., identifying mutations in coronaviruses). Sequence alignment over large databases cannot yield results within a reasonable time, power, and cost. heuristic methods, such as FASTA, the BLAST family have been demonstrated to perform 40 times faster than DP-based (e.g., Needleman-Wunsch) techniques they cannot guarantee an optimum alignment result An optimized software platform of a widely used DNA sequence alignment algorithm called the Needleman-Wunsch (NW) algorithm based on a lookup table, is described in this study. This global alignment algorithm is the best approach for identifying similar regions between sequences. This study presents a new application of classical machine learning (ML) to global sequence alignment. Customized ML models are used to implement NW global alignment. An accuracy of 99.7% is achieved when using a multilayer perceptron with the ADAM optimizer, and up to 2912 Giga cell updates per second are realized on two real DNA sequences with a length of 4.1 M nucleotides. Our implementation is valid for RNA/DNA sequences. This study aims to parallelize the computation steps involved in the algorithm to accelerate its performance by using ML algorithms. All datasets used in this study are available from https://ieeedataport.org/documents/dna-sequence-alignment-datasets-based-nw-algorithm.
INDEX TERMSBioinformatics, DNA, RNA, Pairwise sequence alignment (PWSA), Needleman-Wunsch (NW) algorithm, Machine learning (ML) algorithms, Multilayer perceptron (MLP), XGBoost algorithm.
CONTRIBUTION:This study presented six DNA/RNA sequence alignment datasets for one of the most common alignment algorithms, namely, the Needleman-Wunsch (NW) algorithm. It proposed a fast and parallel implementation of the NW algorithm by using machine learning techniques. This research is an extension and improved version of our previous work [1]. The current implementation achieved 99.7% accuracy by using a multilayer perceptron with the ADAM optimizer and up to 2912 Giga cell updates per second on two real DNA sequences with an of length 4.1 M nucleotides. Our implementation is valid for extremely long sequences by using the divide-and-conquer strategy.