2021
DOI: 10.1093/gigascience/giab057
|View full text |Cite
|
Sign up to set email alerts
|

VC@Scale: Scalable and high-performance variant calling on cluster environments

Abstract: Background Recently many new deep learning–based variant-calling methods like DeepVariant have emerged as more accurate compared with conventional variant-calling algorithms such as GATK HaplotypeCaller, Sterlka2, and Freebayes albeit at higher computational costs. Therefore, there is a need for more scalable and higher performance workflows of these deep learning methods. Almost all existing cluster-scaled variant-calling workflows that use Apache Spark/Hadoop as big data frameworks loosely … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
6
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
3
2

Relationship

1
4

Authors

Journals

citations
Cited by 5 publications
(6 citation statements)
references
References 24 publications
0
6
0
Order By: Relevance
“…The table shows that apart from some tools that reports tests only on a multi-core workstation ( [16] , [17] , [18] , [19] ), Spark has been widely used to implement tools aimed at parallelizing the computation on a distributed computing environment. Most of these tools have been specifically devised for, or tested on, a cloud environment ( [20] , [21] , [22] , [23] , [24] , [25] , [26] , [27] , [28] [29] , [30] , [31] , [32] [33] , [34] , [35] , [36] , [37] ). Being the increasing availability of IaaS (Infrastructure as a Service) cloud computing services, it is desirable that the released tools are commonly designed to be supported also by such infrastructures.…”
Section: Apache Spark In Life Sciencesmentioning
confidence: 99%
See 2 more Smart Citations
“…The table shows that apart from some tools that reports tests only on a multi-core workstation ( [16] , [17] , [18] , [19] ), Spark has been widely used to implement tools aimed at parallelizing the computation on a distributed computing environment. Most of these tools have been specifically devised for, or tested on, a cloud environment ( [20] , [21] , [22] , [23] , [24] , [25] , [26] , [27] , [28] [29] , [30] , [31] , [32] [33] , [34] , [35] , [36] , [37] ). Being the increasing availability of IaaS (Infrastructure as a Service) cloud computing services, it is desirable that the released tools are commonly designed to be supported also by such infrastructures.…”
Section: Apache Spark In Life Sciencesmentioning
confidence: 99%
“…CMAN Ext. tools/frameworks Genomics genome assembly SORA [20] de novo genome assembly GraphX - - - - variant calling DECA [21] copy number variantion discovery MLlib - - - ADAM ADS-HCSpark [48] SNPs and indels calling - - - - - - SparkGA2 [22] variant calling - - - - SparkRA [49] GATK best-practices pipeline - - - - - - DeepVariant on Spark [23] SNPs and indels calling - - Apache Parquet VC@Scale [24] SNPs and indels calling - - - Apache Arrow Halvade Somatic [25] somatic variant calling - - - - - …”
Section: Apache Spark In Life Sciencesmentioning
confidence: 99%
See 1 more Smart Citation
“…Data formats like Apache Parquet, Apache Arrow, Apache Avro have been explored extensively in conjunction with these frameworks to store and process genomic data efficiently. These frameworks include ADAM (Massie et al, 2013), SparkGA2 (Mushtaq et al, 2019), VC@Scale (Ahmad et al, 2021) and Halvade (Decap et al, 2015). Due to many underlying dependencies, inefficient memory usage, issues related to scalability, cluster deployment challenges as well as incompatible data formats, solutions based on these frameworks are still not widely used in the mainstream Bioinformatics community.…”
Section: Introductionmentioning
confidence: 99%
“…In the recent Genome Analysis Toolkit (GATK, McKenna et al ., 2010) version, several programs (including pileup calculations) have been implemented in a distributed manner ready to be run on the Apache Spark cluster. Other research studies confirm that big data programming paradigms can be successfully applied to many genomics analyses (Guo et al ., 2018, Capuccini et al ., 2020, Wiewiórka et al ., 2018, Wiewiórka et al ., 2017) including variant calling(Ahmad et al ., 2021). The analysis of the ever-increasing genomic data sets involves significant financial investments and administrative efforts to maintain secure and fault-tolerant storage solutions as well as fast and scalable processing units.…”
Section: Introductionmentioning
confidence: 99%