In this paper, we present GSWABE, a graphics processing unit (GPU)-accelerated pairwise sequence alignment algorithm for a collection of short DNA sequences. This algorithm supports all-to-all pairwise global, semi-global and local alignment, and retrieves optimal alignments on Compute Unified Device Architecture (CUDA)-enabled GPUs. All of the three alignment types are based on dynamic programming and share almost the same computational pattern. Thus, we have investigated a general tile-based approach to facilitating fast alignment by deeply exploring the powerful compute capability of CUDA-enabled GPUs. The performance of GSWABE has been evaluated on a Kepler-based Tesla K40 GPU using a variety of short DNA sequence datasets. The results show that our algorithm can yield a performance of up to 59.1 billions cell updates per second (GCUPS), 58.5 GCUPS and 50.3 GCUPS for global, semi-global and local alignment, respectively. Furthermore, on the same system GSWABE runs up to 156.0 times faster than the Streaming SIMD Extensions (SSE)-based SSW library and up to 102.4 times faster than the CUDA-based MSA-CUDA (the first stage) in terms of local alignment. Compared with the CUDA-based gpu-pairAlign, GSWABE demonstrates stable and consistent speedups with a maximum speedup of 11.2, 10.7, and 10.6 for global, semi-global, and local alignment, respectively. GSWABE: GPU-ACCELERATED DNA SEQUENCE ALIGNMENT 959 especially for large-scale datasets. This has therefore driven a substantial amount of research to parallelize pairwise alignment on high-performance computing architectures ranging from looselycoupled to tightly-coupled ones, including clouds [12], clusters [13,14], and accelerators [15,16]. Among these architectures, accelerators, including single instruction multiple data (SIMD) vector processing units (VPUs) affiliated to CPUs, field programmable gate arrays (FPGAs), and general-purpose GPUs, have recently been the predominant techniques.The SIMD VPUs affiliated to CPUs are the most widely used techniques. Two general approaches have been investigated to meet the computational features of SIMD vectors: one is the inter-task (or inter-sequence) parallelization model and the other is the intra-task (or intra-sequence) model. The inter-task model performs multiple alignments in individual SIMD vectors with one vector lane computing one alignment (e.g. [17]). The intra-task model computes in parallel the alignment of a single sequence pair in vectors based on two computational patterns: vectorized computation parallel to minor diagonals in the alignment matrix [18] and vectorized computation parallel to the query sequence in a sequential [19] or striped [20] layout. The two kinds of models provide a general framework for other accelerators with SIMD VPUs, including Cell Broadband Engine and general-purpose GPUs. Few implementations [21,22] have been proposed on Cell Broadband Engine, and all of them are based on the intra-task model with the striped layout. On generalpurpose GPUs, open graphics library was initially...