Scalable Workflows and Reproducible Data Analysis for Genomics.

Big data Bioconda Bioinformatics CWL Cloud computing Cluster computing Common Workflow Language Debian Linux Evolutionary biology GNU Guix Guix Workflow Language MPI MrBayes Nextflow Parallelization Snakemake Virtual machine

Journal

Methods in molecular biology (Clifton, N.J.)
ISSN: 1940-6029
Titre abrégé: Methods Mol Biol
Pays: United States
ID NLM: 9214969

Informations de publication

Date de publication:
2019
Historique:
entrez: 7 7 2019
pubmed: 7 7 2019
medline: 9 1 2020
Statut: ppublish

Résumé

Biological, clinical, and pharmacological research now often involves analyses of genomes, transcriptomes, proteomes, and interactomes, within and between individuals and across species. Due to large volumes, the analysis and integration of data generated by such high-throughput technologies have become computationally intensive, and analysis can no longer happen on a typical desktop computer.In this chapter we show how to describe and execute the same analysis using a number of workflow systems and how these follow different approaches to tackle execution and reproducibility issues. We show how any researcher can create a reusable and reproducible bioinformatics pipeline that can be deployed and run anywhere. We show how to create a scalable, reusable, and shareable workflow using four different workflow engines: the Common Workflow Language (CWL), Guix Workflow Language (GWL), Snakemake, and Nextflow. Each of which can be run in parallel.We show how to bundle a number of tools used in evolutionary biology by using Debian, GNU Guix, and Bioconda software distributions, along with the use of container systems, such as Docker, GNU Guix, and Singularity. Together these distributions represent the overall majority of software packages relevant for biology, including PAML, Muscle, MAFFT, MrBayes, and BLAST. By bundling software in lightweight containers, they can be deployed on a desktop, in the cloud, and, increasingly, on compute clusters.By bundling software through these public software distributions, and by creating reproducible and shareable pipelines using these workflow engines, not only do bioinformaticians have to spend less time reinventing the wheel but also do we get closer to the ideal of making science reproducible. The examples in this chapter allow a quick comparison of different solutions.

Identifiants

pubmed: 31278683
doi: 10.1007/978-1-4939-9074-0_24
pmc: PMC7613310
mid: EMS152519
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

723-745

Subventions

Organisme : Wellcome Trust
ID : 203077
Pays : United Kingdom

Références

Nature. 2008 Sep 4;455(7209):16-21
pubmed: 18769411
Nat Biotechnol. 2017 Apr 11;35(4):314-316
pubmed: 28398314
Bioinformatics. 2012 Oct 1;28(19):2520-2
pubmed: 22908215
Brief Bioinform. 2017 May 1;18(3):530-536
pubmed: 27013646
PLoS Comput Biol. 2008 May 30;4(5):e1000069
pubmed: 18516236
BMC Bioinformatics. 2010 Dec 21;11 Suppl 12:S5
pubmed: 21210984
Nature. 2009 Sep 17;461(7262):393-8
pubmed: 19741609
BMC Bioinformatics. 2014;15 Suppl 14:S7
pubmed: 25472764
Bioinformatics. 2005 Mar 1;21(5):676-9
pubmed: 15509596
Mol Syst Biol. 2011 Oct 11;7:539
pubmed: 21988835
Nature. 2010 Oct 28;467(7319):1061-73
pubmed: 20981092
Genome Biol Evol. 2009 Jun 05;1:114-8
pubmed: 20333182
PLoS One. 2017 May 11;12(5):e0177459
pubmed: 28494014
Nat Rev Genet. 2011 Mar;12(3):224
pubmed: 21301471
PeerJ. 2015 Sep 24;3:e1273
pubmed: 26421241
Nucleic Acids Res. 2006 Jul 1;34(Web Server issue):W609-12
pubmed: 16845082
Nat Rev Genet. 2010 Sep;11(9):647-57
pubmed: 20717155
Comput Appl Biosci. 1997 Oct;13(5):555-6
pubmed: 9367129
Bioinformatics. 2003 Aug 12;19(12):1572-4
pubmed: 12912839
Nat Biotechnol. 2017 Apr 11;35(4):316-319
pubmed: 28398311
Nat Methods. 2018 Jul;15(7):475-476
pubmed: 29967506

Auteurs

Francesco Strozzi (F)

Enterome Bioscience, Paris, France.

Roel Janssen (R)

Department of Genetics, Center for Molecular Medicine, University Medical Center Utrecht, Utrecht University, Utrecht, The Netherlands.

Ricardo Wurmus (R)

BIMSB Scientific Bioinformatics Platform, Max Delbrück Center for Molecular Medicine, Berlin, Germany.

Michael R Crusoe (MR)

Common Workflow Language Project, Vilnius, Lithuania.

George Githinji (G)

KEMRI Wellcome Trust Research Programme, Kilifi, Kenya.

Paolo Di Tommaso (P)

Centre for Genomic Regulation (CRG), The Barcelona Institute for Science and Technology, Barcelona, Spain.

Dominique Belhachemi (D)

Life Technologies, Waltham, MA, USA.

Steffen Möller (S)

Institute for Biostatistics and Informatics in Medicine and Ageing Research (IBIMA), Rostock University Medical Center, Rostock, Germany.

Geert Smant (G)

Laboratory of Nematology, Department of Plant Science, Wageningen University, Wageningen, The Netherlands.

Joep de Ligt (J)

Department of Genetics, Center for Molecular Medicine, University Medical Center Utrecht, Utrecht University, Utrecht, The Netherlands.

Pjotr Prins (P)

Department of Genetics, Center for Molecular Medicine, University Medical Center Utrecht, Utrecht University, Utrecht, The Netherlands. pjotr2018@thebird.nl.
Laboratory of Nematology, Department of Plant Science, Wageningen University, Wageningen, The Netherlands. pjotr2018@thebird.nl.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH