MaRe: Processing Big Data with application containers on Apache Spark.
Apache Spark
Big Data
MapReduce
application containers
workflows
Journal
GigaScience
ISSN: 2047-217X
Titre abrégé: Gigascience
Pays: United States
ID NLM: 101596872
Informations de publication
Date de publication:
01 05 2020
01 05 2020
Historique:
received:
09
05
2019
revised:
10
02
2020
accepted:
07
04
2020
entrez:
6
5
2020
pubmed:
6
5
2020
medline:
5
10
2021
Statut:
ppublish
Résumé
Life science is increasingly driven by Big Data analytics, and the MapReduce programming model has been proven successful for data-intensive analyses. However, current MapReduce frameworks offer poor support for reusing existing processing tools in bioinformatics pipelines. Furthermore, these frameworks do not have native support for application containers, which are becoming popular in scientific data processing. Here we present MaRe, an open source programming library that introduces support for Docker containers in Apache Spark. Apache Spark and Docker are the MapReduce framework and container engine that have collected the largest open source community; thus, MaRe provides interoperability with the cutting-edge software ecosystem. We demonstrate MaRe on 2 data-intensive applications in life science, showing ease of use and scalability. MaRe enables scalable data-intensive processing in life science with Apache Spark and application containers. When compared with current best practices, which involve the use of workflow systems, MaRe has the advantage of providing data locality, ingestion from heterogeneous storage systems, and interactive processing. MaRe is generally applicable and available as open source software.
Sections du résumé
BACKGROUND
Life science is increasingly driven by Big Data analytics, and the MapReduce programming model has been proven successful for data-intensive analyses. However, current MapReduce frameworks offer poor support for reusing existing processing tools in bioinformatics pipelines. Furthermore, these frameworks do not have native support for application containers, which are becoming popular in scientific data processing.
RESULTS
Here we present MaRe, an open source programming library that introduces support for Docker containers in Apache Spark. Apache Spark and Docker are the MapReduce framework and container engine that have collected the largest open source community; thus, MaRe provides interoperability with the cutting-edge software ecosystem. We demonstrate MaRe on 2 data-intensive applications in life science, showing ease of use and scalability.
CONCLUSIONS
MaRe enables scalable data-intensive processing in life science with Apache Spark and application containers. When compared with current best practices, which involve the use of workflow systems, MaRe has the advantage of providing data locality, ingestion from heterogeneous storage systems, and interactive processing. MaRe is generally applicable and available as open source software.
Identifiants
pubmed: 32369166
pii: 5829834
doi: 10.1093/gigascience/giaa042
pmc: PMC7199472
pii:
doi:
Types de publication
Journal Article
Research Support, Non-U.S. Gov't
Langues
eng
Sous-ensembles de citation
IM
Informations de copyright
© The Author(s) 2020. Published by Oxford University Press.
Références
PLoS Biol. 2015 Jul 07;13(7):e1002195
pubmed: 26151137
J Chem Inf Model. 2012 Jul 23;52(7):1757-68
pubmed: 22587354
Genome Res. 2010 Sep;20(9):1297-303
pubmed: 20644199
Gigascience. 2019 Feb 1;8(2):
pubmed: 30535405
BMC Med Genomics. 2015 Jul 15;8:37
pubmed: 26173390
Gigascience. 2018 Aug 1;7(8):
pubmed: 30101283
Nucleic Acids Res. 2019 Jan 8;47(D1):D15-D22
pubmed: 30445657
BioData Min. 2014 Oct 29;7:22
pubmed: 25383096
Bioinformatics. 2009 Jul 15;25(14):1754-60
pubmed: 19451168
J Pathol Inform. 2016 Nov 29;7:45
pubmed: 27994937
N Engl J Med. 1999 Jul 1;341(1):28-37
pubmed: 10387940
Nucleic Acids Res. 2010 Apr;38(6):1767-71
pubmed: 20015970
J Med Chem. 1997 Mar 14;40(6):898-902
pubmed: 9083478
Bioinformatics. 2008 Sep 1;24(17):1827-36
pubmed: 18603566
J Chem Inf Model. 2011 Mar 28;51(3):578-96
pubmed: 21323318
Prog Med Chem. 2018;57(1):277-356
pubmed: 29680150
PLoS One. 2016 Jun 22;11(6):e0157989
pubmed: 27331905
Brief Bioinform. 2017 May 1;18(3):530-536
pubmed: 27013646
Bioinformatics. 2019 Mar 1;35(5):839-846
pubmed: 30101309
Bioinformatics. 2011 Aug 1;27(15):2156-8
pubmed: 21653522
Nucleic Acids Res. 2016 Jan 4;44(D1):D1220-8
pubmed: 26582922
AAPS J. 2012 Mar;14(1):133-41
pubmed: 22281989
Nat Biotechnol. 2017 Apr 11;35(4):316-319
pubmed: 28398311
Nature. 2015 Oct 1;526(7571):68-74
pubmed: 26432245
BMC Bioinformatics. 2012 Aug 13;13:200
pubmed: 22888776
Nat Genet. 1999 Jun;22(2):139-44
pubmed: 10369254
Bioinformatics. 2009 Aug 15;25(16):2078-9
pubmed: 19505943
Gigascience. 2020 May 1;9(5):
pubmed: 32369166
Clin Chem. 2017 Oct 1;63(10):1663
pubmed: 32100821
J Cheminform. 2016 Nov 24;8:67
pubmed: 27942268
Gigascience. 2018 May 1;7(5):
pubmed: 29659792
PLoS Biol. 2006 Jun;4(6):e180
pubmed: 16700628