PyGMQL: scalable data extraction and analysis for heterogeneous genomic datasets.

Data Analysis Databases, Genetic Enhancer Elements, Genetic / genetics Genome Genome-Wide Association Study Genomics Humans Reproducibility of Results Software Transcription Factors / metabolism

Data scalability Distribution transparency Genomic data Python Tertiary data analysis

Journal

BMC bioinformatics

ISSN: 1471-2105

Titre abrégé: BMC Bioinformatics

Pays: England

ID NLM: 100965194

Informations de publication

Date de publication:
08 Nov 2019

Historique:

received: 13 06 2019

accepted: 14 10 2019

entrez: 10 11 2019

pubmed: 11 11 2019

medline: 11 1 2020

Statut: epublish

Résumé

With the growth of available sequenced datasets, analysis of heterogeneous processed data can answer increasingly relevant biological and clinical questions. Scientists are challenged in performing efficient and reproducible data extraction and analysis pipelines over heterogeneously processed datasets. Available software packages are suitable for analyzing experimental files from such datasets one by one, but do not scale to thousands of experiments. Moreover, they lack proper support for metadata manipulation. We present PyGMQL, a novel software for the manipulation of region-based genomic files and their relative metadata, built on top of the GMQL genomic big data management system. PyGMQL provides a set of expressive functions for the manipulation of region data and their metadata that can scale to arbitrary clusters and implicitly apply to thousands of files, producing millions of regions. PyGMQL provides data interoperability, distribution transparency and query outsourcing. The PyGMQL package integrates scalable data extraction over the Apache Spark engine underlying the GMQL implementation with native Python support for interactive data analysis and visualization. It supports data interoperability, solving the impedance mismatch between executing set-oriented queries and programming in Python. PyGMQL provides distribution transparency (the ability to address a remote dataset) and query outsourcing (the ability to assign processing to a remote service) in an orthogonal way. Outsourced processing can address cloud-based installations of the GMQL engine. PyGMQL is an effective and innovative tool for supporting tertiary data extraction and analysis pipelines. We demonstrate the expressiveness and performance of PyGMQL through a sequence of biological data analysis scenarios of increasing complexity, which highlight reproducibility, expressive power and scalability.

Sections du résumé

BACKGROUND BACKGROUND

RESULTS RESULTS

We present PyGMQL, a novel software for the manipulation of region-based genomic files and their relative metadata, built on top of the GMQL genomic big data management system. PyGMQL provides a set of expressive functions for the manipulation of region data and their metadata that can scale to arbitrary clusters and implicitly apply to thousands of files, producing millions of regions. PyGMQL provides data interoperability, distribution transparency and query outsourcing. The PyGMQL package integrates scalable data extraction over the Apache Spark engine underlying the GMQL implementation with native Python support for interactive data analysis and visualization. It supports data interoperability, solving the impedance mismatch between executing set-oriented queries and programming in Python. PyGMQL provides distribution transparency (the ability to address a remote dataset) and query outsourcing (the ability to assign processing to a remote service) in an orthogonal way. Outsourced processing can address cloud-based installations of the GMQL engine.

CONCLUSIONS CONCLUSIONS

PyGMQL is an effective and innovative tool for supporting tertiary data extraction and analysis pipelines. We demonstrate the expressiveness and performance of PyGMQL through a sequence of biological data analysis scenarios of increasing complexity, which highlight reproducibility, expressive power and scalability.

Identifiants

DOI: 10.1186/s12859-019-3159-9 PMID: 31703553 PMC: PMC6842186

pubmed: 31703553

doi: 10.1186/s12859-019-3159-9

pii: 10.1186/s12859-019-3159-9

pmc: PMC6842186

doi:

Substances chimiques

Transcription Factors 0

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

Pagination

560

Subventions

Organisme : H2020 European Research Council

ID : 693174

Références

Bioinformatics. 2019 Mar 1;35(5):729-736

pubmed: 30101316

PLoS Comput Biol. 2013;9(8):e1003118

pubmed: 23950696

Bioinformatics. 2009 Jun 1;25(11):1422-3

pubmed: 19304878

Bioinformatics. 2012 Jul 15;28(14):1919-20

pubmed: 22576172

Methods Mol Biol. 2018;1807:63-81

pubmed: 30030804

Blood. 2017 Jul 27;130(4):453-459

pubmed: 28600341

Genome Res. 2010 Sep;20(9):1297-303

pubmed: 20644199

Database (Oxford). 2011 Sep 19;2011:bar026

pubmed: 21930502

Nucleic Acids Res. 2016 Jul 8;44(W1):W3-W10

pubmed: 27137889

Bioinformatics. 2011 Dec 15;27(24):3423-4

pubmed: 21949271

Bioinformatics. 2012 Oct 1;28(19):2520-2

pubmed: 22908215

Genomics Proteomics Bioinformatics. 2018 Oct;16(5):342-353

pubmed: 30578913

Bioinformatics. 2015 Jun 15;31(12):1881-8

pubmed: 25649616

Genet Med. 2013 Mar;15(3):165-71

pubmed: 22975759

Nature. 2015 Feb 19;518(7539):317-30

pubmed: 25693563

Nature. 2012 Sep 6;489(7414):57-74

pubmed: 22955616

Gigascience. 2018 Aug 1;7(8):

pubmed: 30101283

Methods. 2016 Dec 1;111:3-11

pubmed: 27637471

Nat Biotechnol. 2017 Apr 11;35(4):316-319

pubmed: 28398311

Nature. 2015 Feb 19;518(7539):337-43

pubmed: 25363779

Nucleic Acids Res. 2019 Jan 8;47(D1):D766-D773

pubmed: 30357393

Nat Genet. 2013 Oct;45(10):1113-20

pubmed: 24071849

IEEE/ACM Trans Comput Biol Bioinform. 2017 Nov-Dec;14(6):1251-1264

pubmed: 27295683

Nucleic Acids Res. 2016 Jan 4;44(D1):D733-45

pubmed: 26553804

Bioinformatics. 2010 Mar 15;26(6):841-2

pubmed: 20110278

Nucleic Acids Res. 2017 Jan 4;45(D1):D658-D662

pubmed: 27789702

PyGMQL: scalable data extraction and analysis for heterogeneous genomic datasets.

Journal

Informations de publication

Résumé

Sections du résumé

Identifiants

Substances chimiques

Types de publication

Langues

Sous-ensembles de citation

Pagination

Subventions

Références

Auteurs

Luca Nanni (L)

Pietro Pinoli (P)

Arif Canakoglu (A)

Stefano Ceri (S)

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Smoking Cessation and Incident Cardiovascular Disease.

Evaluation of Low-Value Services Across Major Medicare Advantage Insurers and Traditional Medicare.

Effectiveness of Virtual Yoga for Chronic Low Back Pain: A Randomized Clinical Trial.

Classifications MeSH