Scalable Workflows and Reproducible Data Analysis for Genomics.

Big Data Biological Evolution Cloud Computing Computational Biology / methods Data Analysis Genomics / methods Humans Reproducibility of Results Software Workflow

Big data Bioconda Bioinformatics CWL Cloud computing Cluster computing Common Workflow Language Debian Linux Evolutionary biology GNU Guix Guix Workflow Language MPI MrBayes Nextflow Parallelization Snakemake Virtual machine

Journal

Methods in molecular biology (Clifton, N.J.)

ISSN: 1940-6029

Titre abrégé: Methods Mol Biol

Pays: United States

ID NLM: 9214969

Informations de publication

Date de publication:
2019

Historique:

entrez: 7 7 2019

pubmed: 7 7 2019

medline: 9 1 2020

Statut: ppublish

Résumé

Biological, clinical, and pharmacological research now often involves analyses of genomes, transcriptomes, proteomes, and interactomes, within and between individuals and across species. Due to large volumes, the analysis and integration of data generated by such high-throughput technologies have become computationally intensive, and analysis can no longer happen on a typical desktop computer.In this chapter we show how to describe and execute the same analysis using a number of workflow systems and how these follow different approaches to tackle execution and reproducibility issues. We show how any researcher can create a reusable and reproducible bioinformatics pipeline that can be deployed and run anywhere. We show how to create a scalable, reusable, and shareable workflow using four different workflow engines: the Common Workflow Language (CWL), Guix Workflow Language (GWL), Snakemake, and Nextflow. Each of which can be run in parallel.We show how to bundle a number of tools used in evolutionary biology by using Debian, GNU Guix, and Bioconda software distributions, along with the use of container systems, such as Docker, GNU Guix, and Singularity. Together these distributions represent the overall majority of software packages relevant for biology, including PAML, Muscle, MAFFT, MrBayes, and BLAST. By bundling software in lightweight containers, they can be deployed on a desktop, in the cloud, and, increasingly, on compute clusters.By bundling software through these public software distributions, and by creating reproducible and shareable pipelines using these workflow engines, not only do bioinformaticians have to spend less time reinventing the wheel but also do we get closer to the ideal of making science reproducible. The examples in this chapter allow a quick comparison of different solutions.

Identifiants

DOI: 10.1007/978-1-4939-9074-0_24 PMID: 31278683 PMC: PMC7613310

pubmed: 31278683

doi: 10.1007/978-1-4939-9074-0_24

pmc: PMC7613310

mid: EMS152519

doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

Pagination

723-745

Subventions

Organisme : Wellcome Trust

ID : 203077

Pays : United Kingdom

Références

Nature. 2008 Sep 4;455(7209):16-21

pubmed: 18769411

Nat Biotechnol. 2017 Apr 11;35(4):314-316

pubmed: 28398314

Bioinformatics. 2012 Oct 1;28(19):2520-2

pubmed: 22908215

Brief Bioinform. 2017 May 1;18(3):530-536

pubmed: 27013646

PLoS Comput Biol. 2008 May 30;4(5):e1000069

pubmed: 18516236

BMC Bioinformatics. 2010 Dec 21;11 Suppl 12:S5

pubmed: 21210984

Nature. 2009 Sep 17;461(7262):393-8

pubmed: 19741609

BMC Bioinformatics. 2014;15 Suppl 14:S7

pubmed: 25472764

Bioinformatics. 2005 Mar 1;21(5):676-9

pubmed: 15509596

Mol Syst Biol. 2011 Oct 11;7:539

pubmed: 21988835

Nature. 2010 Oct 28;467(7319):1061-73

pubmed: 20981092

Genome Biol Evol. 2009 Jun 05;1:114-8

pubmed: 20333182

PLoS One. 2017 May 11;12(5):e0177459

pubmed: 28494014

Nat Rev Genet. 2011 Mar;12(3):224

pubmed: 21301471

PeerJ. 2015 Sep 24;3:e1273

pubmed: 26421241

Nucleic Acids Res. 2006 Jul 1;34(Web Server issue):W609-12

pubmed: 16845082

Nat Rev Genet. 2010 Sep;11(9):647-57

pubmed: 20717155

Comput Appl Biosci. 1997 Oct;13(5):555-6

pubmed: 9367129

Bioinformatics. 2003 Aug 12;19(12):1572-4

pubmed: 12912839

Nat Biotechnol. 2017 Apr 11;35(4):316-319

pubmed: 28398311

Nat Methods. 2018 Jul;15(7):475-476

pubmed: 29967506

Scalable Workflows and Reproducible Data Analysis for Genomics.

Journal

Informations de publication

Résumé

Identifiants

Types de publication

Langues

Sous-ensembles de citation

Pagination

Subventions

Références

Auteurs

Francesco Strozzi (F)

Roel Janssen (R)

Ricardo Wurmus (R)

Michael R Crusoe (MR)

George Githinji (G)

Paolo Di Tommaso (P)

Dominique Belhachemi (D)

Steffen Möller (S)

Geert Smant (G)

Joep de Ligt (J)

Pjotr Prins (P)

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Smoking Cessation and Incident Cardiovascular Disease.

Evaluation of Low-Value Services Across Major Medicare Advantage Insurers and Traditional Medicare.

Effectiveness of Virtual Yoga for Chronic Low Back Pain: A Randomized Clinical Trial.

Classifications MeSH