Scalable Workflows and Reproducible Data Analysis for Genomics.
Big data
Bioconda
Bioinformatics
CWL
Cloud computing
Cluster computing
Common Workflow Language
Debian Linux
Evolutionary biology
GNU Guix
Guix Workflow Language
MPI
MrBayes
Nextflow
Parallelization
Snakemake
Virtual machine
Journal
Methods in molecular biology (Clifton, N.J.)
ISSN: 1940-6029
Titre abrégé: Methods Mol Biol
Pays: United States
ID NLM: 9214969
Informations de publication
Date de publication:
2019
2019
Historique:
entrez:
7
7
2019
pubmed:
7
7
2019
medline:
9
1
2020
Statut:
ppublish
Résumé
Biological, clinical, and pharmacological research now often involves analyses of genomes, transcriptomes, proteomes, and interactomes, within and between individuals and across species. Due to large volumes, the analysis and integration of data generated by such high-throughput technologies have become computationally intensive, and analysis can no longer happen on a typical desktop computer.In this chapter we show how to describe and execute the same analysis using a number of workflow systems and how these follow different approaches to tackle execution and reproducibility issues. We show how any researcher can create a reusable and reproducible bioinformatics pipeline that can be deployed and run anywhere. We show how to create a scalable, reusable, and shareable workflow using four different workflow engines: the Common Workflow Language (CWL), Guix Workflow Language (GWL), Snakemake, and Nextflow. Each of which can be run in parallel.We show how to bundle a number of tools used in evolutionary biology by using Debian, GNU Guix, and Bioconda software distributions, along with the use of container systems, such as Docker, GNU Guix, and Singularity. Together these distributions represent the overall majority of software packages relevant for biology, including PAML, Muscle, MAFFT, MrBayes, and BLAST. By bundling software in lightweight containers, they can be deployed on a desktop, in the cloud, and, increasingly, on compute clusters.By bundling software through these public software distributions, and by creating reproducible and shareable pipelines using these workflow engines, not only do bioinformaticians have to spend less time reinventing the wheel but also do we get closer to the ideal of making science reproducible. The examples in this chapter allow a quick comparison of different solutions.
Identifiants
pubmed: 31278683
doi: 10.1007/978-1-4939-9074-0_24
pmc: PMC7613310
mid: EMS152519
doi:
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Pagination
723-745Subventions
Organisme : Wellcome Trust
ID : 203077
Pays : United Kingdom
Références
Nature. 2008 Sep 4;455(7209):16-21
pubmed: 18769411
Nat Biotechnol. 2017 Apr 11;35(4):314-316
pubmed: 28398314
Bioinformatics. 2012 Oct 1;28(19):2520-2
pubmed: 22908215
Brief Bioinform. 2017 May 1;18(3):530-536
pubmed: 27013646
PLoS Comput Biol. 2008 May 30;4(5):e1000069
pubmed: 18516236
BMC Bioinformatics. 2010 Dec 21;11 Suppl 12:S5
pubmed: 21210984
Nature. 2009 Sep 17;461(7262):393-8
pubmed: 19741609
BMC Bioinformatics. 2014;15 Suppl 14:S7
pubmed: 25472764
Bioinformatics. 2005 Mar 1;21(5):676-9
pubmed: 15509596
Mol Syst Biol. 2011 Oct 11;7:539
pubmed: 21988835
Nature. 2010 Oct 28;467(7319):1061-73
pubmed: 20981092
Genome Biol Evol. 2009 Jun 05;1:114-8
pubmed: 20333182
PLoS One. 2017 May 11;12(5):e0177459
pubmed: 28494014
Nat Rev Genet. 2011 Mar;12(3):224
pubmed: 21301471
PeerJ. 2015 Sep 24;3:e1273
pubmed: 26421241
Nucleic Acids Res. 2006 Jul 1;34(Web Server issue):W609-12
pubmed: 16845082
Nat Rev Genet. 2010 Sep;11(9):647-57
pubmed: 20717155
Comput Appl Biosci. 1997 Oct;13(5):555-6
pubmed: 9367129
Bioinformatics. 2003 Aug 12;19(12):1572-4
pubmed: 12912839
Nat Biotechnol. 2017 Apr 11;35(4):316-319
pubmed: 28398311
Nat Methods. 2018 Jul;15(7):475-476
pubmed: 29967506