HAPNEST: efficient, large-scale generation and evaluation of synthetic datasets for genotypes and phenotypes.
Journal
Bioinformatics (Oxford, England)
ISSN: 1367-4811
Titre abrégé: Bioinformatics
Pays: England
ID NLM: 9808944
Informations de publication
Date de publication:
02 09 2023
02 09 2023
Historique:
received:
23
12
2022
revised:
23
08
2023
accepted:
29
08
2023
medline:
12
9
2023
pubmed:
30
8
2023
entrez:
30
8
2023
Statut:
ppublish
Résumé
Existing methods for simulating synthetic genotype and phenotype datasets have limited scalability, constraining their usability for large-scale analyses. Moreover, a systematic approach for evaluating synthetic data quality and a benchmark synthetic dataset for developing and evaluating methods for polygenic risk scores are lacking. We present HAPNEST, a novel approach for efficiently generating diverse individual-level genotypic and phenotypic data. In comparison to alternative methods, HAPNEST shows faster computational speed and a lower degree of relatedness with reference panels, while generating datasets that preserve key statistical properties of real data. These desirable synthetic data properties enabled us to generate 6.8 million common variants and nine phenotypes with varying degrees of heritability and polygenicity across 1 million individuals. We demonstrate how HAPNEST can facilitate biobank-scale analyses through the comparison of seven methods to generate polygenic risk scoring across multiple ancestry groups and different genetic architectures. A synthetic dataset of 1 008 000 individuals and nine traits for 6.8 million common variants is available at https://www.ebi.ac.uk/biostudies/studies/S-BSST936. The HAPNEST software for generating synthetic datasets is available as Docker/Singularity containers and open source Julia and C code at https://github.com/intervene-EU-H2020/synthetic_data.
Identifiants
pubmed: 37647640
pii: 7255913
doi: 10.1093/bioinformatics/btad535
pmc: PMC10493177
pii:
doi:
Types de publication
Journal Article
Research Support, Non-U.S. Gov't
Langues
eng
Sous-ensembles de citation
IM
Informations de copyright
© The Author(s) 2023. Published by Oxford University Press.
Références
Am J Hum Genet. 2020 Apr 2;106(4):426-437
pubmed: 32169169
Nat Genet. 2020 Mar;52(3):242-243
pubmed: 32139905
PLoS Comput Biol. 2016 May 04;12(5):e1004842
pubmed: 27145223
Bioinformatics. 2019 Jun 1;35(11):1901-1906
pubmed: 30371734
Bioinformatics. 2002 Feb;18(2):337-8
pubmed: 11847089
Nature. 2010 Sep 2;467(7311):52-8
pubmed: 20811451
Nature. 2020 May;581(7809):434-443
pubmed: 32461654
Bioinformatics. 2011 Aug 15;27(16):2304-5
pubmed: 21653516
Am J Hum Genet. 2020 Nov 5;107(5):895-910
pubmed: 33053335
Nat Commun. 2019 Feb 15;10(1):790
pubmed: 30770844
Genetics. 2003 Dec;165(4):2213-33
pubmed: 14704198
PLoS One. 2014 Apr 23;9(4):e95211
pubmed: 24759998
PLoS Biol. 2020 Jan 17;18(1):e3000586
pubmed: 31951611
Nat Genet. 2015 Nov;47(11):1228-35
pubmed: 26414678
Nat Genet. 2017 Oct;49(10):1421-1427
pubmed: 28892061
Cell Rep Med. 2022 Jul 19;3(7):100687
pubmed: 35858592
PLoS Genet. 2021 May 4;17(5):e1009021
pubmed: 33945532
Front Genet. 2021 Sep 10;12:722602
pubmed: 34567074
Am J Hum Genet. 2007 Sep;81(3):559-75
pubmed: 17701901
Genetics. 2013 Jun;194(2):301-26
pubmed: 23733848
Philos Trans R Soc Lond B Biol Sci. 2005 Jul 29;360(1459):1387-93
pubmed: 16048782
Cell Genom. 2023 Jan 04;3(1):100241
pubmed: 36777179
BMC Bioinformatics. 2019 Jan 15;20(1):26
pubmed: 30646839
Bioinformatics. 2019 Oct 1;35(19):3852-3854
pubmed: 30848784
Bioinformatics. 2020 May 1;36(10):3286-3287
pubmed: 32022854
Nat Protoc. 2020 Sep;15(9):2759-2772
pubmed: 32709988
Nat Genet. 2010 Jul;42(7):565-9
pubmed: 20562875