2021
DOI: 10.1093/gigascience/giab018
|View full text |Cite
|
Sign up to set email alerts
|

Rapid development of cloud-native intelligent data pipelines for scientific data streams using the HASTE Toolkit

Abstract: Background Large streamed datasets, characteristic of life science applications, are often resource-intensive to process, transport and store. We propose a pipeline model, a design pattern for scientific pipelines, where an incoming stream of scientific data is organized into a tiered or ordered “data hierarchy". We introduce the HASTE Toolkit, a proof-of-concept cloud-native software toolkit based on this pipeline model, to partition and prioritize data streams to optimize use of limited com… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
2
1
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(2 citation statements)
references
References 25 publications
0
2
0
Order By: Relevance
“…See, for example, Simmhan et al [601] discuss pipeline designs for reliable, scalable data ingestion in a distributed environment in order to support the Pan-STARRS repository, one of the largest digital surveys that accumulates 100TB of data annually to support 300 astronomers. Blamey et al [602] also provide an informative walk through the challenges and implementations of cloud-native intelligent data pipelines for scientific data streams in life sciences. Deelman et al [603] advocate for co-located (or in situ) computational workflows, to minimize inefficiencies with distributed computing.…”
Section: Data-intensive Science and Computingmentioning
confidence: 99%
“…See, for example, Simmhan et al [601] discuss pipeline designs for reliable, scalable data ingestion in a distributed environment in order to support the Pan-STARRS repository, one of the largest digital surveys that accumulates 100TB of data annually to support 300 astronomers. Blamey et al [602] also provide an informative walk through the challenges and implementations of cloud-native intelligent data pipelines for scientific data streams in life sciences. Deelman et al [603] advocate for co-located (or in situ) computational workflows, to minimize inefficiencies with distributed computing.…”
Section: Data-intensive Science and Computingmentioning
confidence: 99%
“…Faster access to the most relevant data would significantly improve the efficiency of analysis. For example, HASTE project (Blamey et al, 2021) is a unique effort to address challenges related to scientific datasets. The HASTE project takes a hierarchical approach to acquisition, analysis, and interpretation of image data.…”
Section: Introductionmentioning
confidence: 99%