Processing genome-wide association studies within a repository of heterogeneous genomic datasets.
Data integration
GWAS
Genomics
Multiomics studies
Processed datasets
Tertiary data analysis
Journal
BMC genomic data
ISSN: 2730-6844
Titre abrégé: BMC Genom Data
Pays: England
ID NLM: 101775394
Informations de publication
Date de publication:
03 03 2023
03 03 2023
Historique:
received:
02
02
2022
accepted:
02
02
2023
entrez:
3
3
2023
pubmed:
4
3
2023
medline:
8
3
2023
Statut:
epublish
Résumé
Genome Wide Association Studies (GWAS) are based on the observation of genome-wide sets of genetic variants - typically single-nucleotide polymorphisms (SNPs) - in different individuals that are associated with phenotypic traits. Research efforts have so far been directed to improving GWAS techniques rather than on making the results of GWAS interoperable with other genomic signals; this is currently hindered by the use of heterogeneous formats and uncoordinated experiment descriptions. To practically facilitate integrative use, we propose to include GWAS datasets within the META-BASE repository, exploiting an integration pipeline previously studied for other genomic datasets that includes several heterogeneous data types in the same format, queryable from the same systems. We represent GWAS SNPs and metadata by means of the Genomic Data Model and include metadata within a relational representation by extending the Genomic Conceptual Model with a dedicated view. To further reduce the gap with the descriptions of other signals in the repository of genomic datasets, we perform a semantic annotation of phenotypic traits. Our pipeline is demonstrated using two important data sources, initially organized according to different data models: the NHGRI-EBI GWAS Catalog and FinnGen (University of Helsinki). The integration effort finally allows us to use these datasets within multi-sample processing queries that respond to important biological questions. These are then made usable for multi-omic studies together with, e.g., somatic and reference mutation data, genomic annotations, epigenetic signals. As a result of the our work on GWAS datasets, we enable 1) their interoperable use with several other homogenized and processed genomic datasets in the context of the META-BASE repository; 2) their big data processing by means of the GenoMetric Query Language and associated system. Future large-scale tertiary data analysis may extensively benefit from the addition of GWAS results to inform several different downstream analysis workflows.
Sections du résumé
BACKGROUND
Genome Wide Association Studies (GWAS) are based on the observation of genome-wide sets of genetic variants - typically single-nucleotide polymorphisms (SNPs) - in different individuals that are associated with phenotypic traits. Research efforts have so far been directed to improving GWAS techniques rather than on making the results of GWAS interoperable with other genomic signals; this is currently hindered by the use of heterogeneous formats and uncoordinated experiment descriptions.
RESULTS
To practically facilitate integrative use, we propose to include GWAS datasets within the META-BASE repository, exploiting an integration pipeline previously studied for other genomic datasets that includes several heterogeneous data types in the same format, queryable from the same systems. We represent GWAS SNPs and metadata by means of the Genomic Data Model and include metadata within a relational representation by extending the Genomic Conceptual Model with a dedicated view. To further reduce the gap with the descriptions of other signals in the repository of genomic datasets, we perform a semantic annotation of phenotypic traits. Our pipeline is demonstrated using two important data sources, initially organized according to different data models: the NHGRI-EBI GWAS Catalog and FinnGen (University of Helsinki). The integration effort finally allows us to use these datasets within multi-sample processing queries that respond to important biological questions. These are then made usable for multi-omic studies together with, e.g., somatic and reference mutation data, genomic annotations, epigenetic signals.
CONCLUSIONS
As a result of the our work on GWAS datasets, we enable 1) their interoperable use with several other homogenized and processed genomic datasets in the context of the META-BASE repository; 2) their big data processing by means of the GenoMetric Query Language and associated system. Future large-scale tertiary data analysis may extensively benefit from the addition of GWAS results to inform several different downstream analysis workflows.
Identifiants
pubmed: 36869294
doi: 10.1186/s12863-023-01111-y
pii: 10.1186/s12863-023-01111-y
pmc: PMC9985298
doi:
Types de publication
Journal Article
Research Support, Non-U.S. Gov't
Langues
eng
Sous-ensembles de citation
IM
Pagination
13Subventions
Organisme : H2020 European Research Council
ID : 693174
Commentaires et corrections
Type : ErratumIn
Informations de copyright
© 2023. The Author(s).
Références
BMC Genomics. 2017 Sep 22;18(1):749
pubmed: 28938868
Nucleic Acids Res. 2016 Jan 4;44(D1):D869-76
pubmed: 26615194
J Biomed Semantics. 2017 Jun 7;8(1):21
pubmed: 28592275
Bioinformatics. 2016 Oct 15;32(20):3081-3088
pubmed: 27339714
Nat Genet. 2013 Jun;45(6):580-5
pubmed: 23715323
Nucleic Acids Res. 2019 Jan 8;47(D1):D1005-D1012
pubmed: 30445434
Database (Oxford). 2019 Jan 1;2019:
pubmed: 31820804
Nature. 2015 Oct 1;526(7571):68-74
pubmed: 26432245
IEEE/ACM Trans Comput Biol Bioinform. 2013 Jan-Feb;10(1):200-6
pubmed: 23702556
Nucleic Acids Res. 2016 Jul 8;44(W1):W581-6
pubmed: 27084938
BMC Bioinformatics. 2019 Nov 8;20(1):560
pubmed: 31703553
Bioinformatics. 2016 May 15;32(10):1493-501
pubmed: 26773131
PLoS Biol. 2015 Jul 07;13(7):e1002195
pubmed: 26151137
Bioinformatics. 2019 Mar 1;35(5):729-736
pubmed: 30101316
J Biomed Inform. 2009 Jun;42(3):530-9
pubmed: 19475726
Hum Mol Genet. 2011 Oct 15;20(R2):R182-8
pubmed: 21873261
Nucleic Acids Res. 2012 Jan;40(Database issue):D1036-40
pubmed: 22058129
Bioinformatics. 2012 Jul 15;28(14):1919-20
pubmed: 22576172
Breast Cancer Res Treat. 1999 Mar;54(1):1-10
pubmed: 10369075
Nucleic Acids Res. 2020 Jan 8;48(D1):D927-D932
pubmed: 31566222
Nucleic Acids Res. 2004 Jan 1;32(Database issue):D267-70
pubmed: 14681409
Nucleic Acids Res. 2011 Jul;39(Web Server issue):W541-5
pubmed: 21672956
IEEE/ACM Trans Comput Biol Bioinform. 2022 Jan-Feb;19(1):543-557
pubmed: 32750853
Nucleic Acids Res. 2020 Jan 8;48(D1):D933-D940
pubmed: 31612961
Bioinformatics. 2015 Jun 15;31(12):1881-8
pubmed: 25649616
Nucleic Acids Res. 2018 Jan 4;46(D1):D1150-D1156
pubmed: 29059333
BMC Bioinformatics. 2022 Apr 7;23(1):123
pubmed: 35392801
Nat Commun. 2017 Apr 20;8:14828
pubmed: 28425483
Nat Biotechnol. 2020 Jun;38(6):675-678
pubmed: 32444850
Life Sci Alliance. 2019 Dec 9;3(1):
pubmed: 31818883
PLoS One. 2017 Jan 20;12(1):e0167742
pubmed: 28107422
Cancers (Basel). 2018 Nov 16;10(11):
pubmed: 30453575
Nat Rev Genet. 2013 Aug;14(8):549-58
pubmed: 23835440
J Mol Biol. 1970 Mar;48(3):443-53
pubmed: 5420325
Cell Genom. 2021 Oct 13;1(1):
pubmed: 36082306
Nature. 2015 Feb 19;518(7539):317-30
pubmed: 25693563
Brief Bioinform. 2021 Jan 18;22(1):30-44
pubmed: 32496509
Nature. 2012 Sep 6;489(7414):57-74
pubmed: 22955616
Nucleic Acids Res. 2018 Jan 4;46(D1):D175-D180
pubmed: 29069466
PLoS One. 2019 Dec 5;14(12):e0220215
pubmed: 31805043
Wiley Interdiscip Rev RNA. 2018 Jul;9(4):e1474
pubmed: 29582564
Nat Commun. 2017 Nov 28;8(1):1826
pubmed: 29184056
Nat Genet. 2014 Jul;46(7):669-77
pubmed: 24929828
Methods. 2016 Dec 1;111:3-11
pubmed: 27637471
Genome Res. 2012 Sep;22(9):1760-74
pubmed: 22955987
BMC Med Genet. 2009 Jan 22;10:6
pubmed: 19161620
Hum Mutat. 2018 Dec;39(12):2025-2039
pubmed: 30204945
Nature. 2015 Feb 19;518(7539):337-43
pubmed: 25363779
Nat Genet. 2013 Oct;45(10):1113-20
pubmed: 24071849
PLoS Genet. 2020 Aug 17;16(8):e1008977
pubmed: 32804959
BMC Bioinformatics. 2017 Dec 28;18(Suppl 14):509
pubmed: 29297276
Bioinformatics. 2010 Apr 15;26(8):1112-8
pubmed: 20200009
Sci Rep. 2022 Jan 7;12(1):77
pubmed: 34996912
Nat Rev Genet. 2008 May;9(5):356-69
pubmed: 18398418
J R Stat Soc Series B Stat Methodol. 2020 Dec;82(5):1273-1300
pubmed: 37220626
Front Immunol. 2018 Sep 07;9:2046
pubmed: 30245696
Bioinformatics. 2010 Mar 15;26(6):841-2
pubmed: 20110278
Int J Bioinform Res Appl. 2005;1(1):63-80
pubmed: 18048122
BMC Bioinformatics. 2009 Feb 05;10 Suppl 2:S1
pubmed: 19208184
Hum Mutat. 2012 Sep;33(9):1345-51
pubmed: 22753137
Brief Bioinform. 2019 Jul 19;20(4):1477-1491
pubmed: 29579141
IEEE/ACM Trans Comput Biol Bioinform. 2016 Mar-Apr;13(2):233-47
pubmed: 26529777