Rapid development of cloud-native intelligent data pipelines for scientific data streams using the HASTE Toolkit.
HASTE
image analysis
interestingness functions
stream processing
tiered storage
Journal
GigaScience
ISSN: 2047-217X
Titre abrégé: Gigascience
Pays: United States
ID NLM: 101596872
Informations de publication
Date de publication:
19 03 2021
19 03 2021
Historique:
received:
14
09
2020
revised:
26
01
2021
accepted:
23
02
2021
entrez:
19
3
2021
pubmed:
20
3
2021
medline:
17
11
2021
Statut:
ppublish
Résumé
Large streamed datasets, characteristic of life science applications, are often resource-intensive to process, transport and store. We propose a pipeline model, a design pattern for scientific pipelines, where an incoming stream of scientific data is organized into a tiered or ordered "data hierarchy". We introduce the HASTE Toolkit, a proof-of-concept cloud-native software toolkit based on this pipeline model, to partition and prioritize data streams to optimize use of limited computing resources. In our pipeline model, an "interestingness function" assigns an interestingness score to data objects in the stream, inducing a data hierarchy. From this score, a "policy" guides decisions on how to prioritize computational resource use for a given object. The HASTE Toolkit is a collection of tools to adopt this approach. We evaluate with 2 microscopy imaging case studies. The first is a high content screening experiment, where images are analyzed in an on-premise container cloud to prioritize storage and subsequent computation. The second considers edge processing of images for upload into the public cloud for real-time control of a transmission electron microscope. Through our evaluation, we created smart data pipelines capable of effective use of storage, compute, and network resources, enabling more efficient data-intensive experiments. We note a beneficial separation between scientific concerns of data priority, and the implementation of this behaviour for different resources in different deployment contexts. The toolkit allows intelligent prioritization to be `bolted on' to new and existing systems - and is intended for use with a range of technologies in different deployment scenarios.
Sections du résumé
BACKGROUND
Large streamed datasets, characteristic of life science applications, are often resource-intensive to process, transport and store. We propose a pipeline model, a design pattern for scientific pipelines, where an incoming stream of scientific data is organized into a tiered or ordered "data hierarchy". We introduce the HASTE Toolkit, a proof-of-concept cloud-native software toolkit based on this pipeline model, to partition and prioritize data streams to optimize use of limited computing resources.
FINDINGS
In our pipeline model, an "interestingness function" assigns an interestingness score to data objects in the stream, inducing a data hierarchy. From this score, a "policy" guides decisions on how to prioritize computational resource use for a given object. The HASTE Toolkit is a collection of tools to adopt this approach. We evaluate with 2 microscopy imaging case studies. The first is a high content screening experiment, where images are analyzed in an on-premise container cloud to prioritize storage and subsequent computation. The second considers edge processing of images for upload into the public cloud for real-time control of a transmission electron microscope.
CONCLUSIONS
Through our evaluation, we created smart data pipelines capable of effective use of storage, compute, and network resources, enabling more efficient data-intensive experiments. We note a beneficial separation between scientific concerns of data priority, and the implementation of this behaviour for different resources in different deployment contexts. The toolkit allows intelligent prioritization to be `bolted on' to new and existing systems - and is intended for use with a range of technologies in different deployment scenarios.
Identifiants
pubmed: 33739401
pii: 6178703
doi: 10.1093/gigascience/giab018
pmc: PMC7976223
pii:
doi:
Types de publication
Journal Article
Research Support, Non-U.S. Gov't
Langues
eng
Sous-ensembles de citation
IM
Informations de copyright
© The Author(s) 2021. Published by Oxford University Press GigaScience.
Références
Gigascience. 2021 Mar 19;10(3):
pubmed: 33739401
Bioinformatics. 2019 Mar 1;35(5):839-846
pubmed: 30101309
PLoS Biol. 2018 Jul 3;16(7):e2005970
pubmed: 29969450
J Struct Biol. 2016 Jul;195(1):93-9
pubmed: 27108186
J Struct Biol. 2017 Oct;200(1):20-27
pubmed: 28658599
Sci Rep. 2017 Feb 23;7:43167
pubmed: 28230161
IEEE J Biomed Health Inform. 2021 Feb;25(2):371-380
pubmed: 32750907
Methods Mol Biol. 2018;1683:89-112
pubmed: 29082489
PLoS Biol. 2015 Jul 07;13(7):e1002195
pubmed: 26151137
Nat Biotechnol. 2014 Jun;32(6):524-7
pubmed: 24911492
J Biomol Screen. 2012 Feb;17(2):266-74
pubmed: 21956170
Bioinformatics. 2019 Jan 1;35(1):119-121
pubmed: 29931085
Proteomics. 2015 Apr;15(8):1419-27
pubmed: 25663356