StreamAligner: a streaming based sequence aligner on Apache Spark

Rathee, Sanjay; Kashyap, Arti

doi:10.1186/s40537-018-0114-y

Cited by 7 publications

(5 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Various DSPFs have been proposed for special purposes, such as multimedia streaming framework [12], P2P live framework [13], and fraud detection framework [14]. To process genomics data in a fast and efficient way, a novel sequence aligner was implemented on Apache Spark [15]. e multiquery component of Apache Flink was optimized for big data [16].…”

Section: Review On Streaming Frameworkmentioning

confidence: 99%

Probabilistic Hesitant Fuzzy Methods for Prioritizing Distributed Stream Processing Frameworks for IoT Applications

Lin

Huang

Lin

2021

Mathematical Problems in Engineering

View full text Add to dashboard Cite

Distributed stream processing frameworks (DSPFs) are the vital engine, which can handle real-time data processing and analytics for IoT applications. How to prioritize DSPFs and select the most suitable one for special IoT applications is an open issue. To help developers of IoT applications to solve this complex issue, a novel probabilistic hesitant fuzzy multicriteria decision making (MCDM) model is put forward in this paper. To characterize the requirements for large-scale IoT data stream processing, a novel evaluation criteria system including qualitative and quantitative criteria is established. To accurately model the collective opinions from skilled developers and consider their psychological distance, the definition of probabilistic hesitant fuzzy sets (PHFSs) is used. To derive the importance degrees of criteria, a novel probabilistic hesitant fuzzy best-worst (PHFBW) method is proposed based on the score value. To prioritize the DSPFs and choose the most suitable one, a novel probabilistic hesitant fuzzy MULTIMOORA method is put forward. Finally, a practical case composed of four Apache stream processing frameworks, namely, Storm, Flink, Spark, and Samza, is studied. The obtained results indicate that throughput, latency, and reliability are considered to be the three most important criteria, and Flink is the most suitable stream framework.

show abstract

Section: Review On Streaming Frameworkmentioning

confidence: 99%

Probabilistic Hesitant Fuzzy Methods for Prioritizing Distributed Stream Processing Frameworks for IoT Applications

Lin

Huang

Lin

2021

Mathematical Problems in Engineering

View full text Add to dashboard Cite

show abstract

“…For instance, we can find in the literature several solutions for estimating the number of k-mers in genomic datasets, such as KmerStream [ 28 ], ntCard [ 29 ], KmerEstimate [ 30 ] and Khmer [ 31 ]. Other tools are focused on sequence alignment (StreamAligner [ 32 ], StreamBWA [ 33 ]), metagenomics profiling (Flint [ 34 ]) and DNA analysis (SparkGA2 [ 35 ]). These latter examples are all implemented on top of the legacy Spark Streaming API instead of using Spark Structured Streaming as in our approach.…”

Section: Related Workmentioning

confidence: 99%

SeQual-Stream: approaching stream processing to quality control of NGS datasets

Castellanos-Rodríguez,

Expósito,

Touriño

2023

BMC Bioinformatics

View full text Add to dashboard Cite

Background Quality control of DNA sequences is an important data preprocessing step in many genomic analyses. However, all existing parallel tools for this purpose are based on a batch processing model, needing to have the complete genetic dataset before processing can even begin. This limitation clearly hinders quality control performance in those scenarios where the dataset must be downloaded from a remote repository and/or copied to a distributed file system for its parallel processing. Results In this paper we present SeQual-Stream, a streaming tool that allows performing multiple quality control operations on genomic datasets in a fast, distributed and scalable way. To do so, our approach relies on the Apache Spark framework and the Hadoop Distributed File System (HDFS) to fully exploit the stream paradigm and accelerate the preprocessing of large datasets as they are being downloaded and/or copied to HDFS. The experimental results have shown significant improvements in the execution times of SeQual-Stream when compared to a batch processing tool with similar quality control features, providing a maximum speedup of 2.7$$\times$$ × when processing a dataset with more than 250 million DNA sequences, while also demonstrating good scalability features. Conclusion Our solution provides a more scalable and higher performance way to carry out quality control of large genomic datasets by taking advantage of stream processing features. The tool is distributed as free open-source software released under the GNU AGPLv3 license and is publicly available to download at https://github.com/UDC-GAC/SeQual-Stream.

show abstract

“…Therefore the raw data (containing reads of any length) produced by a sequencing machine can be considered a static data set. Additionally, because the reads are generated individually, it would be possible to design an indexing algorithm that is built incrementally in real time [35]. Once built, a set of indexed reads can be rapidly queried for sequences of interest, such as structural variations, pathogenic variants, or viruses.…”

Section: Aiding Computationmentioning

confidence: 99%

“…However, if we take a broader look at the data sets involved in WGS analysis, we can see that a read set generated for a genome is unchanged during analysis, with the exception of preprocessing and error correction. Reads are reported sequentially and, thus, it is entirely possible to design an indexing algorithm that is built incrementally in real time as the reads are outputted by the sequencing machine [35].…”

Section: Box 2 Indexing a Set Of Readsmentioning

confidence: 99%