cuDTW++: Ultra-Fast Dynamic Time Warping on CUDA-Enabled GPUs

Schmidt, Bertil; Hundt, Christian

doi:10.1007/978-3-030-57675-2_37

Cited by 9 publications

(7 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…sDTW, on the other hand, is a data-reusing version of the approach, and our work exploits the fine-grain parallelism that computes the whole O ( M ) dimension in parallel, leaving O ( M + N ) computational time and O ( M ) space. Furthermore, there is prior work that accelerates DTW using nonvolatile memories [ 51 ] and using GPU acceleration [ 52 , 53 ].…”

Section: Discussionmentioning

confidence: 99%

Efficient real-time selective genome sequencing on resource-constrained devices

et al. 2022

View full text Add to dashboard Cite

Background Third-generation nanopore sequencers offer selective sequencing or “Read Until” that allows genomic reads to be analyzed in real time and abandoned halfway if not belonging to a genomic region of “interest.” This selective sequencing opens the door to important applications such as rapid and low-cost genetic tests. The latency in analyzing should be as low as possible for selective sequencing to be effective so that unnecessary reads can be rejected as early as possible. However, existing methods that employ a subsequence dynamic time warping (sDTW) algorithm for this problem are too computationally intensive that a massive workstation with dozens of CPU cores still struggles to keep up with the data rate of a mobile phone–sized MinION sequencer. Results In this article, we present Hardware Accelerated Read Until (HARU), a resource-efficient hardware–software codesign-based method that exploits a low-cost and portable heterogeneous multiprocessor system-on-chip platform with on-chip field-programmable gate arrays (FPGA) to accelerate the sDTW-based Read Until algorithm. Experimental results show that HARU on a Xilinx FPGA embedded with a 4-core ARM processor is around 2.5× faster than a highly optimized multithreaded software version (around 85× faster than the existing unoptimized multithreaded software) running on a sophisticated server with a 36-core Intel Xeon processor for a SARS-CoV-2 dataset. The energy consumption of HARU is 2 orders of magnitudes lower than the same application executing on the 36-core server. Conclusions HARU demonstrates that nanopore selective sequencing is possible on resource-constrained devices through rigorous hardware–software optimizations. The source code for the HARU sDTW module is available as open source at https://github.com/beebdev/HARU, and an example application that uses HARU is at https://github.com/beebdev/sigfish-haru.

show abstract

Section: Discussionmentioning

confidence: 99%

Efficient real-time selective genome sequencing on resource-constrained devices

et al. 2022

View full text Add to dashboard Cite

show abstract

“…For intra-sub-matrix communication, we exploit warp shuffles for efficient register-to-register transfers within the same warp. This is an idea demonstrated by Schmidt et al 27 but not completely explored. Threads in a warp use warp shuffles to transfer the query sample, the minimum score of the segment, and the score of the last cell in the segment to the thread on its right.…”

Section: Dtwax: Architecture Figure 3 Efficient Intra-and Inter-matri...mentioning

confidence: 93%

“…DTWax can be reprogrammed to test for any target reference of interest. Unlike some of the prior works 4, 27 , DTWax can be reporgrammed to test for longer target references. Further, one may easily try and scale DTWax across multiple GPUs for higher throughput on longer or multiple target references.…”

Section: Methodsmentioning

confidence: 99%

“…We adopt the segmented-sDTW architecture introduced in prior works 27 where each segment is a fixed number of cells in a row whose scores are computed by a thread. DTWax breaks down the processing of longer target references into multiple sub-matrices, each processing a fixed number of target bases.…”

Section: Dtwax: Architecture Figure 3 Efficient Intra-and Inter-matri...mentioning

confidence: 99%

“…Crescent 26 is a recent closed-source implementation of SquiggleFilter’s algorithm directly on the GPU but ends up being 29.5X lower in throughput than DTWax possibly because of several reasons including not utilizing warp synchronized register shuffles for data sharing between threads and fewer number of cells computed per thread. cuDTW++ 27 is the best-performing prior work on GPU which accelerates DTW. However, cuDTW++ is ~2.6X slower than DTWax and is built for database querying of very small queries and not for subsequence Dynamic Time Warping that is required to perform Read Until.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Accelerated Dynamic Time Warping on GPU for Selective Nanopore Sequencing

Sadasivan

Stiffler

Tirumala

et al. 2023

Preprint

View full text Add to dashboard Cite

The design and supply of RT-PCR primers for accurate virus testing is a complex process. The MinION is a revolutionary portable nanopore DNA sequencer that may be used to sequence the whole genome of a target virus in a biological sample. Human samples have more than 99% of non-target host DNA and Read Until is a protocol that enables the MinION to selectively eject reads in real-time. However, the MinION does not have any in-built compute power to select non-target reads. SquiggleFilter is a prior work that identified the accuracy and throughput challenges in performing Read Until using the state-of-the-art solution and proposed a hardware-accelerated subsequence Dynamic Time Warping (sDTW) based programmable filter on an ASIC. However, SquiggleFilter does not work for genomes larger than 100Kb. We optimize SquiggleFilter’s sDTW algorithm onto the more commonly available GPUs.DTWaxbetter uses tensor core pipes, 2X-SIMD FP16 computations and efficient data handling strategies using offline pre-processing, coalesced global memory loads, warp shuffles and shared memory buffering among other optimizations.DTWaxenables Read Until and yields 1.92X sequencing speedup and 3.64X compute speedup: costup over a sequencing workflow that does not use Read Until.

show abstract

An end-to-end machine learning approach with explanation for time series with varying lengths

Schneider,

Greifzu,

Wang

et al. 2024

Neural Comput & Applic

View full text Add to dashboard Cite

An accurate prediction of complex product quality parameters from process time series by an end-to-end learning approach remains a significant challenge in machine learning. A special difficulty is the application of industrial batch process data because many batch processes generate variable length time series. In the industrial application of such methods, explainability is often desired. In this study, a 1D convolutional neural network (CNN) algorithm with a masking layer is proposed to solve the problem for time series of variable length. In addition, a novel combination of 1D CNN and class activation mapping (CAM) technique is part of this study to better understand the model results and highlight some regions of interest in the time series. As a comparative state-of-the-art unsupervised machine learning method, the One-Nearest Neighbours (1NN) algorithm combined with dynamic time warping (DTW) was used. Both methods are investigated as end-to-end learning methods with balanced and unbalanced class distributions and with scaled and unscaled input data, respectively. The FastDTW and DTAIDistance algorithms were investigated for the DTW calculation. The data set is made up of sensor signals that was collected during the production of plastic parts. The objective was to predict a quality parameter of plastic parts during production. For this research, the quality parameter will be a difficult or only destructively measurable parameter and both methods will be investigated for their applicability to this prediction task. The application of the proposed approach to an industrial facility for producing plastic products shows a prediction accuracy of 83.7%. It can improve the reverence method by approximately 1.4%. In addition to the slight increase in accuracy, the CNN training time was significantly reduced compared to the DTW calculation.

show abstract

cuDTW++: Ultra-Fast Dynamic Time Warping on CUDA-Enabled GPUs

Cited by 9 publications

References 22 publications

Efficient real-time selective genome sequencing on resource-constrained devices

Efficient real-time selective genome sequencing on resource-constrained devices

Accelerated Dynamic Time Warping on GPU for Selective Nanopore Sequencing

An end-to-end machine learning approach with explanation for time series with varying lengths

Contact Info

Product

Resources

About