Biological sequence alignment is one of the most important operations in Bioinformatics, executing thousands of times every day around the world. The exact algorithms for this purpose have quadratic time complexity. So when the comparison involves very long sequences, such as in the human genome, matrices with petabytes must be calculated, and this is still considered unfeasible by most researchers. The main objective of this Thesis is to propose and evaluate algorithms and optimizations that produce the optimal alignment of very long DNA sequences in a short time using high-performance computing platforms. The proposed algorithms use parallel divide-and-conquer techniques with reduced memory complexity, whilst with quadratic time complexity. CUDAlign, in its versions 2.0, 2.1, 3.0 and 4.0, is the main contribution of this Thesis. The proposed algorithms are integrated into the same tool, allowing efficient retrieval of the optimal alignment between two long DNA sequences using multiple GPUs (Graphics Processing Unit) from NVIDIA. The proposed optimizations maintain the maximum parallelism during most of the processing time. To accelerate the matrix calculation in a single GPU, the Orthogonal Execution, Balanced Partition and Block Pruning optimizations were proposed, increasing the performance of the matrix computation and discarding areas that do not contribute to the optimal alignment. The formal analysis of Block Pruning shows that its effectiveness depends on factors such as the sequences similarity and the matrix processing order. During the alignment computation with multiple GPUs, the Incremental Speculative Traceback optimization is proposed to accelerate the alignment retrieval, using speculated values with high accuracy rate. A dynamic load balancing method has also been proposed and its effectiveness has been shown in simulated environments. Finally, the software architecture called Multi-Platform Architecture for Sequence aligners (MASA) was proposed to simplify the portability of CUDAlign to different hardware and software platforms. With this architecture, it was possible to port CUDAlign to hardware platforms such as CPU and Intel Phi, and using software platforms such as OpenMP and OmpSs. In this Thesis, real sequences are used to validate the effectiveness of the proposed algorithms and optimizations in several supported architectures. Our proposed tools were able to advance the state-of-the-art of sequence alignment algorithms, allowing a fast retrieval of all human and chimpanzee homologous chromosomes, using exact algorithms at an unprecedented rate of up to 10.35 TCUPS (Trillions of Cells Updated Per Second). As far as we know, this was the first time that this type of comparison was carried out with exact sequence comparison algorithms.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.