SciPipe: A workflow library for agile development of complex and dynamic bioinformatics pipelines.

Go Golang flow-based programming machine learning pipelines reproducibility scientific workflow management systems

Journal

GigaScience
ISSN: 2047-217X
Titre abrégé: Gigascience
Pays: United States
ID NLM: 101596872

Informations de publication

Date de publication:
01 05 2019
Historique:
received: 06 10 2018
revised: 03 03 2019
accepted: 28 03 2019
entrez: 28 4 2019
pubmed: 28 4 2019
medline: 31 12 2019
Statut: ppublish

Résumé

The complex nature of biological data has driven the development of specialized software tools. Scientific workflow management systems simplify the assembly of such tools into pipelines, assist with job automation, and aid reproducibility of analyses. Many contemporary workflow tools are specialized or not designed for highly complex workflows, such as with nested loops, dynamic scheduling, and parametrization, which is common in, e.g., machine learning. SciPipe is a workflow programming library implemented in the programming language Go, for managing complex and dynamic pipelines in bioinformatics, cheminformatics, and other fields. SciPipe helps in particular with workflow constructs common in machine learning, such as extensive branching, parameter sweeps, and dynamic scheduling and parametrization of downstream tasks. SciPipe builds on flow-based programming principles to support agile development of workflows based on a library of self-contained, reusable components. It supports running subsets of workflows for improved iterative development and provides a data-centric audit logging feature that saves a full audit trace for every output file of a workflow, which can be converted to other formats such as HTML, TeX, and PDF on demand. The utility of SciPipe is demonstrated with a machine learning pipeline, a genomics, and a transcriptomics pipeline. SciPipe provides a solution for agile development of complex and dynamic pipelines, especially in machine learning, through a flexible application programming interface suitable for scientists used to programming or scripting.

Sections du résumé

BACKGROUND
The complex nature of biological data has driven the development of specialized software tools. Scientific workflow management systems simplify the assembly of such tools into pipelines, assist with job automation, and aid reproducibility of analyses. Many contemporary workflow tools are specialized or not designed for highly complex workflows, such as with nested loops, dynamic scheduling, and parametrization, which is common in, e.g., machine learning.
FINDINGS
SciPipe is a workflow programming library implemented in the programming language Go, for managing complex and dynamic pipelines in bioinformatics, cheminformatics, and other fields. SciPipe helps in particular with workflow constructs common in machine learning, such as extensive branching, parameter sweeps, and dynamic scheduling and parametrization of downstream tasks. SciPipe builds on flow-based programming principles to support agile development of workflows based on a library of self-contained, reusable components. It supports running subsets of workflows for improved iterative development and provides a data-centric audit logging feature that saves a full audit trace for every output file of a workflow, which can be converted to other formats such as HTML, TeX, and PDF on demand. The utility of SciPipe is demonstrated with a machine learning pipeline, a genomics, and a transcriptomics pipeline.
CONCLUSIONS
SciPipe provides a solution for agile development of complex and dynamic pipelines, especially in machine learning, through a flexible application programming interface suitable for scientists used to programming or scripting.

Identifiants

pubmed: 31029061
pii: 5480570
doi: 10.1093/gigascience/giz044
pmc: PMC6486472
pii:
doi:

Banques de données

figshare
['10.6084/m9.figshare.3985674']

Types de publication

Journal Article Research Support, Non-U.S. Gov't

Langues

eng

Sous-ensembles de citation

IM

Informations de copyright

© The Author(s) 2019. Published by Oxford University Press.

Références

PLoS Biol. 2015 Jul 07;13(7):e1002195
pubmed: 26151137
Bioinformatics. 2014 Apr 1;30(7):923-30
pubmed: 24227677
Front Neuroinform. 2011 Aug 22;5:13
pubmed: 21897815
Genome Res. 2010 Sep;20(9):1297-303
pubmed: 20644199
Bioinformatics. 2016 Jan 15;32(2):292-4
pubmed: 26428292
Nat Biotechnol. 2017 Apr 11;35(4):314-316
pubmed: 28398314
Genome Res. 2005 Oct;15(10):1451-5
pubmed: 16169926
Bioinformatics. 2016 Oct 1;32(19):3047-8
pubmed: 27312411
J Chem Inf Comput Sci. 2003 May-Jun;43(3):707-20
pubmed: 12767129
Bioinformatics. 2013 Jan 1;29(1):15-21
pubmed: 23104886
Source Code Biol Med. 2012 Feb 15;7(1):1
pubmed: 22333270
Bioinformatics. 2009 Aug 15;25(16):2078-9
pubmed: 19505943
J Cheminform. 2016 Nov 24;8:67
pubmed: 27942268
Bioinformatics. 2009 Jul 15;25(14):1754-60
pubmed: 19451168
Bioinformatics. 2012 Jun 1;28(11):1525-6
pubmed: 22500002
Pac Symp Biocomput. 2016;22:154-165
pubmed: 27896971
Front Pharmacol. 2018 Nov 06;9:1256
pubmed: 30459617
Bioinformatics. 2012 Oct 1;28(19):2520-2
pubmed: 22908215
J Cheminform. 2016 Aug 10;8:39
pubmed: 27516811
Bioinformatics. 2019 Mar 1;35(5):839-846
pubmed: 30101309
Gigascience. 2019 Dec 1;8(12):
pubmed: 31825479
Nat Rev Genet. 2015 Feb;16(2):85-97
pubmed: 25582081
Biol Direct. 2015 Aug 19;10:43
pubmed: 26282399
Nat Biotechnol. 2017 Apr 11;35(4):316-319
pubmed: 28398311
Gigascience. 2018 May 1;7(5):
pubmed: 29718199
Nature. 2013 Jun 13;498(7453):255-60
pubmed: 23765498
Nat Methods. 2010 Mar;7(3 Suppl):S56-68
pubmed: 20195258
Curr Protoc Mol Biol. 2010 Jan;Chapter 19:Unit 19.10.1-21
pubmed: 20069535
Gigascience. 2018 May 1;7(5):
pubmed: 29659792

Auteurs

Samuel Lampa (S)

Department of Pharmaceutical Biosciences and Science for Life Laboratory, Uppsala University, Box 591, 751 24, Uppsala, Sweden.
Department of Biochemistry and Biophysics, National Bioinformatics Infrastructure Sweden, Science for Life Laboratory, Stockholm University, Svante Arrhenius väg 16C, 106 91, Solna, Sweden.

Martin Dahlö (M)

Department of Pharmaceutical Biosciences and Science for Life Laboratory, Uppsala University, Box 591, 751 24, Uppsala, Sweden.

Jonathan Alvarsson (J)

Department of Pharmaceutical Biosciences and Science for Life Laboratory, Uppsala University, Box 591, 751 24, Uppsala, Sweden.

Ola Spjuth (O)

Department of Pharmaceutical Biosciences and Science for Life Laboratory, Uppsala University, Box 591, 751 24, Uppsala, Sweden.

Articles similaires

Selecting optimal software code descriptors-The case of Java.

Yegor Bugayenko, Zamira Kholmatova, Artem Kruglov et al.
1.00
Software Algorithms Programming Languages

Exploring blood-brain barrier passage using atomic weighted vector and machine learning.

Yoan Martínez-López, Paulina Phoobane, Yanaima Jauriga et al.
1.00
Blood-Brain Barrier Machine Learning Humans Support Vector Machine Software

Understanding the role of machine learning in predicting progression of osteoarthritis.

Simone Castagno, Benjamin Gompels, Estelle Strangmark et al.
1.00
Humans Disease Progression Machine Learning Osteoarthritis
Coal Metagenome Phylogeny Bacteria Genome, Bacterial

Classifications MeSH