Rapid development of cloud-native intelligent data pipelines for scientific data streams using the HASTE Toolkit

Blamey, Ben; Toor, Salman; Dahlö, Martin; Wieslander, Håkan; Harrison, P J; Sintorn, Ida‐Maria; Sabirsh, Alan; Wählby, Carolina; Spjuth, Ola; Hellander, Andreas

doi:10.1101/2020.09.13.274779

2020

DOI: 10.1101/2020.09.13.274779

|View full text |Cite

Preprint

Rapid development of cloud-native intelligent data pipelines for scientific data streams using the HASTE Toolkit

Ben Blamey

Salman Toor

Martin Dahlö

et al.

Abstract: This paper introduces the HASTE Toolkit, a cloud-native software toolkit capable of partitioning data streams in order to prioritize usage of limited resources. This in turn enables more efficient data-intensive experiments. We propose a model that introduces automated, autonomous decision making in data pipelines, such that a stream of data can be partitioned into a tiered or ordered data hierarchy. Importantly, the partitioning is online and based on data content rather than a priori metadata. At the core of… Show more

Help me understand this report

View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

Supporting

Mentioning

Contrasting

Year Published

2021

Publication Types

Select...

Article1

Relationship

Self Cite0

Independent1

Authors

Journals

Cited by 1 publication

References 23 publications

(26 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

Rapid development of cloud-native intelligent data pipelines for scientific data streams using the HASTE Toolkit

et al. 2021

View full text Add to dashboard Cite

Background Large streamed datasets, characteristic of life science applications, are often resource-intensive to process, transport and store. We propose a pipeline model, a design pattern for scientific pipelines, where an incoming stream of scientific data is organized into a tiered or ordered “data hierarchy". We introduce the HASTE Toolkit, a proof-of-concept cloud-native software toolkit based on this pipeline model, to partition and prioritize data streams to optimize use of limited computing resources. Findings In our pipeline model, an “interestingness function” assigns an interestingness score to data objects in the stream, inducing a data hierarchy. From this score, a “policy” guides decisions on how to prioritize computational resource use for a given object. The HASTE Toolkit is a collection of tools to adopt this approach. We evaluate with 2 microscopy imaging case studies. The first is a high content screening experiment, where images are analyzed in an on-premise container cloud to prioritize storage and subsequent computation. The second considers edge processing of images for upload into the public cloud for real-time control of a transmission electron microscope. Conclusions Through our evaluation, we created smart data pipelines capable of effective use of storage, compute, and network resources, enabling more efficient data-intensive experiments. We note a beneficial separation between scientific concerns of data priority, and the implementation of this behaviour for different resources in different deployment contexts. The toolkit allows intelligent prioritization to be `bolted on' to new and existing systems – and is intended for use with a range of technologies in different deployment scenarios.

show abstract

Rapid development of cloud-native intelligent data pipelines for scientific data streams using the HASTE Toolkit

et al. 2021

View full text Add to dashboard Cite

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Rapid development of cloud-native intelligent data pipelines for scientific data streams using the HASTE Toolkit

Cited by 1 publication

References 23 publications

Rapid development of cloud-native intelligent data pipelines for scientific data streams using the HASTE Toolkit

Rapid development of cloud-native intelligent data pipelines for scientific data streams using the HASTE Toolkit

Contact Info

Product

Resources

About