Serverless computing in omics data analysis and integration

Grzesik, Piotr; Augustyn, Dariusz R; Wyciślik, Łukasz; Mrozek, Dariusz

doi:10.1093/bib/bbab349

Cited by 29 publications

(18 citation statements)

References 47 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this way, we can build and run applications and services without provisioning or managing servers. This architecture has also been adopted in some other studies ( 35 , 36 ), showing its high reliability, robustness and scalability. AutoESD can support hundreds of end-users simultaneously submitting design jobs with over 2000 genetic manipulation targets per job, and all the jobs can be processed in parallel in minutes.…”

Section: Discussionmentioning

confidence: 99%

AutoESD: a web tool for automatic editing sequence design for genetic manipulation of microorganisms

Yang

Mao

Wang

et al. 2022

Nucleic Acids Research

View full text Add to dashboard Cite

Advances in genetic manipulation and genome engineering techniques have enabled on-demand targeted deletion, insertion, and substitution of DNA sequences. One important step in these techniques is the design of editing sequences (e.g. primers, homologous arms) to precisely target and manipulate DNA sequences of interest. Experimental biologists can employ multiple tools in a stepwise manner to assist editing sequence design (ESD), but this requires various software involving non-standardized data exchange and input/output formats. Moreover, necessary quality control steps might be overlooked by non-expert users. This approach is low-throughput and can be error-prone, which illustrates the need for an automated ESD system. In this paper, we introduce AutoESD (https://autoesd.biodesign.ac.cn/), which designs editing sequences for all steps of genetic manipulation of many common homologous-recombination techniques based on screening-markers. Notably, multiple types of manipulations for different targets (CDS or intergenic region) can be processed in one submission. Moreover, AutoESD has an entirely cloud-based serverless architecture, offering high reliability, robustness and scalability which is capable of parallelly processing hundreds of design tasks each having thousands of targets in minutes. To our knowledge, AutoESD is the first cloud platform enabling precise, automated, and high-throughput ESD across species, at any genomic locus for all manipulation types.

show abstract

Section: Discussionmentioning

confidence: 99%

AutoESD: a web tool for automatic editing sequence design for genetic manipulation of microorganisms

Yang

Mao

Wang

et al. 2022

Nucleic Acids Research

View full text Add to dashboard Cite

show abstract

“…In some cases, communities compare the products in terms of cost efficiency and availability to make informed choices about the workflow they will propose to users. An example is offered by a recent overview of the serverless computing scenario in bioinformatics by Grzesik et al [21].…”

Section: State Of the Artmentioning

confidence: 99%

Leveraging an open source serverless framework for high energy physics computing

et al. 2023

View full text Add to dashboard Cite

CERN (Centre Europeen pour la Recherce Nucleaire) is the largest research centre for high energy physics (HEP). It offers unique computational challenges as a result of the large amount of data generated by the large hadron collider. CERN has developed and supports a software called ROOT, which is the de facto standard for HEP data analysis. This framework offers a high-level and easy-to-use interface called RDataFrame, which allows managing and processing large data sets. In recent years, its functionality has been extended to take advantage of distributed computing capabilities. Thanks to its declarative programming model, the user-facing API can be decoupled from the actual execution backend. This decoupling allows physical analysis to scale automatically to thousands of computational cores over various types of distributed resources. In fact, the distributed RDataFrame module already supports the use of established general industry engines such as Apache Spark or Dask. Notwithstanding the foregoing, these current solutions will not be sufficient to meet future requirements in terms of the amount of data that the new projected accelerators will generate. It is of interest, for this reason, to investigate a different approach, the one offered by serverless computing. Based on a first prototype using AWS Lambda, this work presents the creation of a new backend for RDataFrame distributed over the OSCAR tool, an open source framework that supports serverless computing. The implementation introduces new ways, relative to the AWS Lambda-based prototype, to synchronize the work of functions.

show abstract

“…Cloud computing addresses many of the challenges associated with large whole genome sequencing projects, which can suffer from siloed data, long download times, and slow worlkflow runtimes (Tanjo et al, 2021). Several papers have reviewed the potential of cloud platforms for sequence data storage, sharing, and analysis (Augustyn et al, 2021; Cole & Moore, 2018; Grossman, 2019; Grzesik et al, 2021; Koppad et al, 2021; Langmead & Nellore, 2018; Leonard et al, 2019), thus here we focus on one cloud computing challenge, how to select the right compute configuration to optimize both cost and performance (Krissaane et al, 2020; Ray et al, 2021).…”

Section: Introductionmentioning

confidence: 99%

Accelerating genomic workflows using NVIDIA Parabricks

O’Connell

Yosufzai

Campbell

et al. 2022

Preprint

View full text Add to dashboard Cite

BackgroundAs genome sequencing becomes a more integral part of scientific research, government policy, and personalized medicine, the primary challenge for researchers is shifting from generating raw data to analyzing these vast datasets. Although much work has been done to reduce compute times using various configurations of traditional CPU computing infrastructures, Graphics Processing Units (GPUs) offer the opportunity to accelerate genomic workflows by several orders of magnitude. Here we benchmark one GPU-accelerated software suite called NVIDIA Parabricks on Amazon Web Services (AWS), Google Cloud Platform (GCP), and an NVIDIA DGX cluster. We benchmarked six variant calling pipelines, including two germline callers (HaplotypeCaller and DeepVariant) and four somatic callers (Mutect2, Muse, LoFreq, SomaticSniper).ResultsFor germline callers, we achieved up to 65x acceleration, bringing HaplotypeCaller runtime down from 36 hours to 33 minutes on AWS, 35 minutes on GCP, and 24 minutes on the NVIDIA DGX. Somatic callers exhibited more variation between the number of GPUs and computing platforms. On cloud platforms, GPU-accelerated germline callers resulted in cost savings compared with CPU runs, whereas somatic callers were often more expensive than CPU runs because their GPU acceleration was not sufficient to overcome the increased GPU cost.ConclusionsGermline variant callers scaled with the number of GPUs across platforms, whereas somatic variant callers exhibited more variation in the number of GPUs with the fastest runtimes, suggesting that these workflows are less GPU optimized and require benchmarking on the platform of choice before being deployed at production scales. Our study demonstrates that GPUs can be used to greatly accelerate genomic workflows, thus bringing closer to grasp urgent societal advances in the areas of biosurveillance and personalized medicine.

show abstract

Serverless computing in omics data analysis and integration

Cited by 29 publications

References 47 publications

AutoESD: a web tool for automatic editing sequence design for genetic manipulation of microorganisms

AutoESD: a web tool for automatic editing sequence design for genetic manipulation of microorganisms

Leveraging an open source serverless framework for high energy physics computing

Accelerating genomic workflows using NVIDIA Parabricks

Contact Info

Product

Resources

About