SciPipe: A workflow library for agile development of complex and dynamic bioinformatics pipelines.
Go
Golang
flow-based programming
machine learning
pipelines
reproducibility
scientific workflow management systems
Journal
GigaScience
ISSN: 2047-217X
Titre abrégé: Gigascience
Pays: United States
ID NLM: 101596872
Informations de publication
Date de publication:
01 05 2019
01 05 2019
Historique:
received:
06
10
2018
revised:
03
03
2019
accepted:
28
03
2019
entrez:
28
4
2019
pubmed:
28
4
2019
medline:
31
12
2019
Statut:
ppublish
Résumé
The complex nature of biological data has driven the development of specialized software tools. Scientific workflow management systems simplify the assembly of such tools into pipelines, assist with job automation, and aid reproducibility of analyses. Many contemporary workflow tools are specialized or not designed for highly complex workflows, such as with nested loops, dynamic scheduling, and parametrization, which is common in, e.g., machine learning. SciPipe is a workflow programming library implemented in the programming language Go, for managing complex and dynamic pipelines in bioinformatics, cheminformatics, and other fields. SciPipe helps in particular with workflow constructs common in machine learning, such as extensive branching, parameter sweeps, and dynamic scheduling and parametrization of downstream tasks. SciPipe builds on flow-based programming principles to support agile development of workflows based on a library of self-contained, reusable components. It supports running subsets of workflows for improved iterative development and provides a data-centric audit logging feature that saves a full audit trace for every output file of a workflow, which can be converted to other formats such as HTML, TeX, and PDF on demand. The utility of SciPipe is demonstrated with a machine learning pipeline, a genomics, and a transcriptomics pipeline. SciPipe provides a solution for agile development of complex and dynamic pipelines, especially in machine learning, through a flexible application programming interface suitable for scientists used to programming or scripting.
Sections du résumé
BACKGROUND
The complex nature of biological data has driven the development of specialized software tools. Scientific workflow management systems simplify the assembly of such tools into pipelines, assist with job automation, and aid reproducibility of analyses. Many contemporary workflow tools are specialized or not designed for highly complex workflows, such as with nested loops, dynamic scheduling, and parametrization, which is common in, e.g., machine learning.
FINDINGS
SciPipe is a workflow programming library implemented in the programming language Go, for managing complex and dynamic pipelines in bioinformatics, cheminformatics, and other fields. SciPipe helps in particular with workflow constructs common in machine learning, such as extensive branching, parameter sweeps, and dynamic scheduling and parametrization of downstream tasks. SciPipe builds on flow-based programming principles to support agile development of workflows based on a library of self-contained, reusable components. It supports running subsets of workflows for improved iterative development and provides a data-centric audit logging feature that saves a full audit trace for every output file of a workflow, which can be converted to other formats such as HTML, TeX, and PDF on demand. The utility of SciPipe is demonstrated with a machine learning pipeline, a genomics, and a transcriptomics pipeline.
CONCLUSIONS
SciPipe provides a solution for agile development of complex and dynamic pipelines, especially in machine learning, through a flexible application programming interface suitable for scientists used to programming or scripting.
Identifiants
pubmed: 31029061
pii: 5480570
doi: 10.1093/gigascience/giz044
pmc: PMC6486472
pii:
doi:
Banques de données
figshare
['10.6084/m9.figshare.3985674']
Types de publication
Journal Article
Research Support, Non-U.S. Gov't
Langues
eng
Sous-ensembles de citation
IM
Informations de copyright
© The Author(s) 2019. Published by Oxford University Press.
Références
PLoS Biol. 2015 Jul 07;13(7):e1002195
pubmed: 26151137
Bioinformatics. 2014 Apr 1;30(7):923-30
pubmed: 24227677
Front Neuroinform. 2011 Aug 22;5:13
pubmed: 21897815
Genome Res. 2010 Sep;20(9):1297-303
pubmed: 20644199
Bioinformatics. 2016 Jan 15;32(2):292-4
pubmed: 26428292
Nat Biotechnol. 2017 Apr 11;35(4):314-316
pubmed: 28398314
Genome Res. 2005 Oct;15(10):1451-5
pubmed: 16169926
Bioinformatics. 2016 Oct 1;32(19):3047-8
pubmed: 27312411
J Chem Inf Comput Sci. 2003 May-Jun;43(3):707-20
pubmed: 12767129
Bioinformatics. 2013 Jan 1;29(1):15-21
pubmed: 23104886
Source Code Biol Med. 2012 Feb 15;7(1):1
pubmed: 22333270
Bioinformatics. 2009 Aug 15;25(16):2078-9
pubmed: 19505943
J Cheminform. 2016 Nov 24;8:67
pubmed: 27942268
Bioinformatics. 2009 Jul 15;25(14):1754-60
pubmed: 19451168
Bioinformatics. 2012 Jun 1;28(11):1525-6
pubmed: 22500002
Pac Symp Biocomput. 2016;22:154-165
pubmed: 27896971
Front Pharmacol. 2018 Nov 06;9:1256
pubmed: 30459617
Bioinformatics. 2012 Oct 1;28(19):2520-2
pubmed: 22908215
J Cheminform. 2016 Aug 10;8:39
pubmed: 27516811
Bioinformatics. 2019 Mar 1;35(5):839-846
pubmed: 30101309
Gigascience. 2019 Dec 1;8(12):
pubmed: 31825479
Nat Rev Genet. 2015 Feb;16(2):85-97
pubmed: 25582081
Biol Direct. 2015 Aug 19;10:43
pubmed: 26282399
Nat Biotechnol. 2017 Apr 11;35(4):316-319
pubmed: 28398311
Gigascience. 2018 May 1;7(5):
pubmed: 29718199
Nature. 2013 Jun 13;498(7453):255-60
pubmed: 23765498
Nat Methods. 2010 Mar;7(3 Suppl):S56-68
pubmed: 20195258
Curr Protoc Mol Biol. 2010 Jan;Chapter 19:Unit 19.10.1-21
pubmed: 20069535
Gigascience. 2018 May 1;7(5):
pubmed: 29659792