MaRe: Processing Big Data with application containers on Apache Spark.

Algorithms Big Data Computational Biology / methods Databases, Factual Polymorphism, Single Nucleotide Software Workflow

Apache Spark Big Data MapReduce application containers workflows

Journal

GigaScience

ISSN: 2047-217X

Titre abrégé: Gigascience

Pays: United States

ID NLM: 101596872

Informations de publication

Date de publication:
01 05 2020

Historique:

received: 09 05 2019

revised: 10 02 2020

accepted: 07 04 2020

entrez: 6 5 2020

pubmed: 6 5 2020

medline: 5 10 2021

Statut: ppublish

Résumé

Life science is increasingly driven by Big Data analytics, and the MapReduce programming model has been proven successful for data-intensive analyses. However, current MapReduce frameworks offer poor support for reusing existing processing tools in bioinformatics pipelines. Furthermore, these frameworks do not have native support for application containers, which are becoming popular in scientific data processing. Here we present MaRe, an open source programming library that introduces support for Docker containers in Apache Spark. Apache Spark and Docker are the MapReduce framework and container engine that have collected the largest open source community; thus, MaRe provides interoperability with the cutting-edge software ecosystem. We demonstrate MaRe on 2 data-intensive applications in life science, showing ease of use and scalability. MaRe enables scalable data-intensive processing in life science with Apache Spark and application containers. When compared with current best practices, which involve the use of workflow systems, MaRe has the advantage of providing data locality, ingestion from heterogeneous storage systems, and interactive processing. MaRe is generally applicable and available as open source software.

Sections du résumé

BACKGROUND

RESULTS

Here we present MaRe, an open source programming library that introduces support for Docker containers in Apache Spark. Apache Spark and Docker are the MapReduce framework and container engine that have collected the largest open source community; thus, MaRe provides interoperability with the cutting-edge software ecosystem. We demonstrate MaRe on 2 data-intensive applications in life science, showing ease of use and scalability.

CONCLUSIONS

MaRe enables scalable data-intensive processing in life science with Apache Spark and application containers. When compared with current best practices, which involve the use of workflow systems, MaRe has the advantage of providing data locality, ingestion from heterogeneous storage systems, and interactive processing. MaRe is generally applicable and available as open source software.

Identifiants

DOI: 10.1093/gigascience/giaa042 PMID: 32369166 PMC: PMC7199472

pubmed: 32369166

pii: 5829834

doi: 10.1093/gigascience/giaa042

pmc: PMC7199472

pii:

doi:

Types de publication

Journal Article Research Support, Non-U.S. Gov't

Langues

eng

Sous-ensembles de citation

Informations de copyright

Références

PLoS Biol. 2015 Jul 07;13(7):e1002195

pubmed: 26151137

J Chem Inf Model. 2012 Jul 23;52(7):1757-68

pubmed: 22587354

Genome Res. 2010 Sep;20(9):1297-303

pubmed: 20644199

Gigascience. 2019 Feb 1;8(2):

pubmed: 30535405

BMC Med Genomics. 2015 Jul 15;8:37

pubmed: 26173390

Gigascience. 2018 Aug 1;7(8):

pubmed: 30101283

Nucleic Acids Res. 2019 Jan 8;47(D1):D15-D22

pubmed: 30445657

BioData Min. 2014 Oct 29;7:22

pubmed: 25383096

Bioinformatics. 2009 Jul 15;25(14):1754-60

pubmed: 19451168

J Pathol Inform. 2016 Nov 29;7:45

pubmed: 27994937

N Engl J Med. 1999 Jul 1;341(1):28-37

pubmed: 10387940

Nucleic Acids Res. 2010 Apr;38(6):1767-71

pubmed: 20015970

J Med Chem. 1997 Mar 14;40(6):898-902

pubmed: 9083478

Bioinformatics. 2008 Sep 1;24(17):1827-36

pubmed: 18603566

J Chem Inf Model. 2011 Mar 28;51(3):578-96

pubmed: 21323318

Prog Med Chem. 2018;57(1):277-356

pubmed: 29680150

PLoS One. 2016 Jun 22;11(6):e0157989

pubmed: 27331905

Brief Bioinform. 2017 May 1;18(3):530-536

pubmed: 27013646

Bioinformatics. 2019 Mar 1;35(5):839-846

pubmed: 30101309

Bioinformatics. 2011 Aug 1;27(15):2156-8

pubmed: 21653522

Nucleic Acids Res. 2016 Jan 4;44(D1):D1220-8

pubmed: 26582922

AAPS J. 2012 Mar;14(1):133-41

pubmed: 22281989

Nat Biotechnol. 2017 Apr 11;35(4):316-319

pubmed: 28398311

Nature. 2015 Oct 1;526(7571):68-74

pubmed: 26432245

BMC Bioinformatics. 2012 Aug 13;13:200

pubmed: 22888776

Nat Genet. 1999 Jun;22(2):139-44

pubmed: 10369254

Bioinformatics. 2009 Aug 15;25(16):2078-9

pubmed: 19505943

Gigascience. 2020 May 1;9(5):

pubmed: 32369166

Clin Chem. 2017 Oct 1;63(10):1663

pubmed: 32100821

J Cheminform. 2016 Nov 24;8:67

pubmed: 27942268

Gigascience. 2018 May 1;7(5):

pubmed: 29659792

PLoS Biol. 2006 Jun;4(6):e180

pubmed: 16700628

MaRe: Processing Big Data with application containers on Apache Spark.

Journal

Informations de publication

Résumé

Sections du résumé

Identifiants

Types de publication

Langues

Sous-ensembles de citation

Informations de copyright

Références

Auteurs

Marco Capuccini (M)

Martin Dahlö (M)

Salman Toor (S)

Ola Spjuth (O)

Articles similaires

Selecting optimal software code descriptors-The case of Java.

Exploring blood-brain barrier passage using atomic weighted vector and machine learning.

Fasciola hepatica and Fasciola hybrid form co-existence in yak from Tibet of China: application of rDNA internal transcribed spacer.

The causal effects of lifestyle, circulating, pigment, and metabolic factors on early age-related macular degeneration: a comprehensive Mendelian randomization study.

Classifications MeSH