PyGMQL: scalable data extraction and analysis for heterogeneous genomic datasets.


Journal

BMC bioinformatics
ISSN: 1471-2105
Titre abrégé: BMC Bioinformatics
Pays: England
ID NLM: 100965194

Informations de publication

Date de publication:
08 Nov 2019
Historique:
received: 13 06 2019
accepted: 14 10 2019
entrez: 10 11 2019
pubmed: 11 11 2019
medline: 11 1 2020
Statut: epublish

Résumé

With the growth of available sequenced datasets, analysis of heterogeneous processed data can answer increasingly relevant biological and clinical questions. Scientists are challenged in performing efficient and reproducible data extraction and analysis pipelines over heterogeneously processed datasets. Available software packages are suitable for analyzing experimental files from such datasets one by one, but do not scale to thousands of experiments. Moreover, they lack proper support for metadata manipulation. We present PyGMQL, a novel software for the manipulation of region-based genomic files and their relative metadata, built on top of the GMQL genomic big data management system. PyGMQL provides a set of expressive functions for the manipulation of region data and their metadata that can scale to arbitrary clusters and implicitly apply to thousands of files, producing millions of regions. PyGMQL provides data interoperability, distribution transparency and query outsourcing. The PyGMQL package integrates scalable data extraction over the Apache Spark engine underlying the GMQL implementation with native Python support for interactive data analysis and visualization. It supports data interoperability, solving the impedance mismatch between executing set-oriented queries and programming in Python. PyGMQL provides distribution transparency (the ability to address a remote dataset) and query outsourcing (the ability to assign processing to a remote service) in an orthogonal way. Outsourced processing can address cloud-based installations of the GMQL engine. PyGMQL is an effective and innovative tool for supporting tertiary data extraction and analysis pipelines. We demonstrate the expressiveness and performance of PyGMQL through a sequence of biological data analysis scenarios of increasing complexity, which highlight reproducibility, expressive power and scalability.

Sections du résumé

BACKGROUND BACKGROUND
With the growth of available sequenced datasets, analysis of heterogeneous processed data can answer increasingly relevant biological and clinical questions. Scientists are challenged in performing efficient and reproducible data extraction and analysis pipelines over heterogeneously processed datasets. Available software packages are suitable for analyzing experimental files from such datasets one by one, but do not scale to thousands of experiments. Moreover, they lack proper support for metadata manipulation.
RESULTS RESULTS
We present PyGMQL, a novel software for the manipulation of region-based genomic files and their relative metadata, built on top of the GMQL genomic big data management system. PyGMQL provides a set of expressive functions for the manipulation of region data and their metadata that can scale to arbitrary clusters and implicitly apply to thousands of files, producing millions of regions. PyGMQL provides data interoperability, distribution transparency and query outsourcing. The PyGMQL package integrates scalable data extraction over the Apache Spark engine underlying the GMQL implementation with native Python support for interactive data analysis and visualization. It supports data interoperability, solving the impedance mismatch between executing set-oriented queries and programming in Python. PyGMQL provides distribution transparency (the ability to address a remote dataset) and query outsourcing (the ability to assign processing to a remote service) in an orthogonal way. Outsourced processing can address cloud-based installations of the GMQL engine.
CONCLUSIONS CONCLUSIONS
PyGMQL is an effective and innovative tool for supporting tertiary data extraction and analysis pipelines. We demonstrate the expressiveness and performance of PyGMQL through a sequence of biological data analysis scenarios of increasing complexity, which highlight reproducibility, expressive power and scalability.

Identifiants

pubmed: 31703553
doi: 10.1186/s12859-019-3159-9
pii: 10.1186/s12859-019-3159-9
pmc: PMC6842186
doi:

Substances chimiques

Transcription Factors 0

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

560

Subventions

Organisme : H2020 European Research Council
ID : 693174

Références

Bioinformatics. 2019 Mar 1;35(5):729-736
pubmed: 30101316
PLoS Comput Biol. 2013;9(8):e1003118
pubmed: 23950696
Bioinformatics. 2009 Jun 1;25(11):1422-3
pubmed: 19304878
Bioinformatics. 2012 Jul 15;28(14):1919-20
pubmed: 22576172
Methods Mol Biol. 2018;1807:63-81
pubmed: 30030804
Blood. 2017 Jul 27;130(4):453-459
pubmed: 28600341
Genome Res. 2010 Sep;20(9):1297-303
pubmed: 20644199
Database (Oxford). 2011 Sep 19;2011:bar026
pubmed: 21930502
Nucleic Acids Res. 2016 Jul 8;44(W1):W3-W10
pubmed: 27137889
Bioinformatics. 2011 Dec 15;27(24):3423-4
pubmed: 21949271
Bioinformatics. 2012 Oct 1;28(19):2520-2
pubmed: 22908215
Genomics Proteomics Bioinformatics. 2018 Oct;16(5):342-353
pubmed: 30578913
Bioinformatics. 2015 Jun 15;31(12):1881-8
pubmed: 25649616
Genet Med. 2013 Mar;15(3):165-71
pubmed: 22975759
Nature. 2015 Feb 19;518(7539):317-30
pubmed: 25693563
Nature. 2012 Sep 6;489(7414):57-74
pubmed: 22955616
Gigascience. 2018 Aug 1;7(8):
pubmed: 30101283
Methods. 2016 Dec 1;111:3-11
pubmed: 27637471
Nat Biotechnol. 2017 Apr 11;35(4):316-319
pubmed: 28398311
Nature. 2015 Feb 19;518(7539):337-43
pubmed: 25363779
Nucleic Acids Res. 2019 Jan 8;47(D1):D766-D773
pubmed: 30357393
Nat Genet. 2013 Oct;45(10):1113-20
pubmed: 24071849
IEEE/ACM Trans Comput Biol Bioinform. 2017 Nov-Dec;14(6):1251-1264
pubmed: 27295683
Nucleic Acids Res. 2016 Jan 4;44(D1):D733-45
pubmed: 26553804
Bioinformatics. 2010 Mar 15;26(6):841-2
pubmed: 20110278
Nucleic Acids Res. 2017 Jan 4;45(D1):D658-D662
pubmed: 27789702

Auteurs

Luca Nanni (L)

Department of Electronics, Information and Bioengineering, Politecnico di Milano, Milan, Italy. luca.nanni@polimi.it.

Pietro Pinoli (P)

Department of Electronics, Information and Bioengineering, Politecnico di Milano, Milan, Italy.

Arif Canakoglu (A)

Department of Electronics, Information and Bioengineering, Politecnico di Milano, Milan, Italy.

Stefano Ceri (S)

Department of Electronics, Information and Bioengineering, Politecnico di Milano, Milan, Italy.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH