Rapid development of cloud-native intelligent data pipelines for scientific data streams using the HASTE Toolkit.

Biological Science Disciplines Diagnostic Imaging Software

HASTE image analysis interestingness functions stream processing tiered storage

Journal

GigaScience

ISSN: 2047-217X

Titre abrégé: Gigascience

Pays: United States

ID NLM: 101596872

Informations de publication

Date de publication:
19 03 2021

Historique:

received: 14 09 2020

revised: 26 01 2021

accepted: 23 02 2021

entrez: 19 3 2021

pubmed: 20 3 2021

medline: 17 11 2021

Statut: ppublish

Résumé

Large streamed datasets, characteristic of life science applications, are often resource-intensive to process, transport and store. We propose a pipeline model, a design pattern for scientific pipelines, where an incoming stream of scientific data is organized into a tiered or ordered "data hierarchy". We introduce the HASTE Toolkit, a proof-of-concept cloud-native software toolkit based on this pipeline model, to partition and prioritize data streams to optimize use of limited computing resources. In our pipeline model, an "interestingness function" assigns an interestingness score to data objects in the stream, inducing a data hierarchy. From this score, a "policy" guides decisions on how to prioritize computational resource use for a given object. The HASTE Toolkit is a collection of tools to adopt this approach. We evaluate with 2 microscopy imaging case studies. The first is a high content screening experiment, where images are analyzed in an on-premise container cloud to prioritize storage and subsequent computation. The second considers edge processing of images for upload into the public cloud for real-time control of a transmission electron microscope. Through our evaluation, we created smart data pipelines capable of effective use of storage, compute, and network resources, enabling more efficient data-intensive experiments. We note a beneficial separation between scientific concerns of data priority, and the implementation of this behaviour for different resources in different deployment contexts. The toolkit allows intelligent prioritization to be `bolted on' to new and existing systems - and is intended for use with a range of technologies in different deployment scenarios.

Sections du résumé

BACKGROUND

FINDINGS

In our pipeline model, an "interestingness function" assigns an interestingness score to data objects in the stream, inducing a data hierarchy. From this score, a "policy" guides decisions on how to prioritize computational resource use for a given object. The HASTE Toolkit is a collection of tools to adopt this approach. We evaluate with 2 microscopy imaging case studies. The first is a high content screening experiment, where images are analyzed in an on-premise container cloud to prioritize storage and subsequent computation. The second considers edge processing of images for upload into the public cloud for real-time control of a transmission electron microscope.

CONCLUSIONS

Through our evaluation, we created smart data pipelines capable of effective use of storage, compute, and network resources, enabling more efficient data-intensive experiments. We note a beneficial separation between scientific concerns of data priority, and the implementation of this behaviour for different resources in different deployment contexts. The toolkit allows intelligent prioritization to be `bolted on' to new and existing systems - and is intended for use with a range of technologies in different deployment scenarios.

Identifiants

DOI: 10.1093/gigascience/giab018 PMID: 33739401 PMC: PMC7976223

pubmed: 33739401

pii: 6178703

doi: 10.1093/gigascience/giab018

pmc: PMC7976223

pii:

doi:

Types de publication

Journal Article Research Support, Non-U.S. Gov't

Langues

eng

Sous-ensembles de citation

Informations de copyright

Références

Gigascience. 2021 Mar 19;10(3):

pubmed: 33739401

Bioinformatics. 2019 Mar 1;35(5):839-846

pubmed: 30101309

PLoS Biol. 2018 Jul 3;16(7):e2005970

pubmed: 29969450

J Struct Biol. 2016 Jul;195(1):93-9

pubmed: 27108186

J Struct Biol. 2017 Oct;200(1):20-27

pubmed: 28658599

Sci Rep. 2017 Feb 23;7:43167

pubmed: 28230161

IEEE J Biomed Health Inform. 2021 Feb;25(2):371-380

pubmed: 32750907

Methods Mol Biol. 2018;1683:89-112

pubmed: 29082489

PLoS Biol. 2015 Jul 07;13(7):e1002195

pubmed: 26151137

Nat Biotechnol. 2014 Jun;32(6):524-7

pubmed: 24911492

J Biomol Screen. 2012 Feb;17(2):266-74

pubmed: 21956170

Bioinformatics. 2019 Jan 1;35(1):119-121

pubmed: 29931085

Proteomics. 2015 Apr;15(8):1419-27

pubmed: 25663356

Rapid development of cloud-native intelligent data pipelines for scientific data streams using the HASTE Toolkit.

Journal

Informations de publication

Résumé

Sections du résumé

Identifiants

Types de publication

Langues

Sous-ensembles de citation

Informations de copyright

Références

Auteurs

Ben Blamey (B)

Salman Toor (S)

Martin Dahlö (M)

Håkan Wieslander (H)

Philip J Harrison (PJ)

Ida-Maria Sintorn (IM)

Alan Sabirsh (A)

Carolina Wählby (C)

Ola Spjuth (O)

Andreas Hellander (A)

Articles similaires

Selecting optimal software code descriptors-The case of Java.

Exploring blood-brain barrier passage using atomic weighted vector and machine learning.

Accuracy of web-based automated versus digital manual cephalometric landmark identification.

An arithmetic operation P system based on symmetric ternary system.

Classifications MeSH