HAPNEST: efficient, large-scale generation and evaluation of synthetic datasets for genotypes and phenotypes.


Journal

Bioinformatics (Oxford, England)
ISSN: 1367-4811
Titre abrégé: Bioinformatics
Pays: England
ID NLM: 9808944

Informations de publication

Date de publication:
02 09 2023
Historique:
received: 23 12 2022
revised: 23 08 2023
accepted: 29 08 2023
medline: 12 9 2023
pubmed: 30 8 2023
entrez: 30 8 2023
Statut: ppublish

Résumé

Existing methods for simulating synthetic genotype and phenotype datasets have limited scalability, constraining their usability for large-scale analyses. Moreover, a systematic approach for evaluating synthetic data quality and a benchmark synthetic dataset for developing and evaluating methods for polygenic risk scores are lacking. We present HAPNEST, a novel approach for efficiently generating diverse individual-level genotypic and phenotypic data. In comparison to alternative methods, HAPNEST shows faster computational speed and a lower degree of relatedness with reference panels, while generating datasets that preserve key statistical properties of real data. These desirable synthetic data properties enabled us to generate 6.8 million common variants and nine phenotypes with varying degrees of heritability and polygenicity across 1 million individuals. We demonstrate how HAPNEST can facilitate biobank-scale analyses through the comparison of seven methods to generate polygenic risk scoring across multiple ancestry groups and different genetic architectures. A synthetic dataset of 1 008 000 individuals and nine traits for 6.8 million common variants is available at https://www.ebi.ac.uk/biostudies/studies/S-BSST936. The HAPNEST software for generating synthetic datasets is available as Docker/Singularity containers and open source Julia and C code at https://github.com/intervene-EU-H2020/synthetic_data.

Identifiants

pubmed: 37647640
pii: 7255913
doi: 10.1093/bioinformatics/btad535
pmc: PMC10493177
pii:
doi:

Types de publication

Journal Article Research Support, Non-U.S. Gov't

Langues

eng

Sous-ensembles de citation

IM

Informations de copyright

© The Author(s) 2023. Published by Oxford University Press.

Références

Am J Hum Genet. 2020 Apr 2;106(4):426-437
pubmed: 32169169
Nat Genet. 2020 Mar;52(3):242-243
pubmed: 32139905
PLoS Comput Biol. 2016 May 04;12(5):e1004842
pubmed: 27145223
Bioinformatics. 2019 Jun 1;35(11):1901-1906
pubmed: 30371734
Bioinformatics. 2002 Feb;18(2):337-8
pubmed: 11847089
Nature. 2010 Sep 2;467(7311):52-8
pubmed: 20811451
Nature. 2020 May;581(7809):434-443
pubmed: 32461654
Bioinformatics. 2011 Aug 15;27(16):2304-5
pubmed: 21653516
Am J Hum Genet. 2020 Nov 5;107(5):895-910
pubmed: 33053335
Nat Commun. 2019 Feb 15;10(1):790
pubmed: 30770844
Genetics. 2003 Dec;165(4):2213-33
pubmed: 14704198
PLoS One. 2014 Apr 23;9(4):e95211
pubmed: 24759998
PLoS Biol. 2020 Jan 17;18(1):e3000586
pubmed: 31951611
Nat Genet. 2015 Nov;47(11):1228-35
pubmed: 26414678
Nat Genet. 2017 Oct;49(10):1421-1427
pubmed: 28892061
Cell Rep Med. 2022 Jul 19;3(7):100687
pubmed: 35858592
PLoS Genet. 2021 May 4;17(5):e1009021
pubmed: 33945532
Front Genet. 2021 Sep 10;12:722602
pubmed: 34567074
Am J Hum Genet. 2007 Sep;81(3):559-75
pubmed: 17701901
Genetics. 2013 Jun;194(2):301-26
pubmed: 23733848
Philos Trans R Soc Lond B Biol Sci. 2005 Jul 29;360(1459):1387-93
pubmed: 16048782
Cell Genom. 2023 Jan 04;3(1):100241
pubmed: 36777179
BMC Bioinformatics. 2019 Jan 15;20(1):26
pubmed: 30646839
Bioinformatics. 2019 Oct 1;35(19):3852-3854
pubmed: 30848784
Bioinformatics. 2020 May 1;36(10):3286-3287
pubmed: 32022854
Nat Protoc. 2020 Sep;15(9):2759-2772
pubmed: 32709988
Nat Genet. 2010 Jul;42(7):565-9
pubmed: 20562875

Auteurs

Sophie Wharrie (S)

Department of Computer Science, Aalto University, Espoo 02150, Finland.

Zhiyu Yang (Z)

Institute for Molecular Medicine Finland, FIMM, HiLIFE, University of Helsinki, Helsinki 00014, Finland.

Vishnu Raj (V)

Department of Computer Science, Aalto University, Espoo 02150, Finland.

Remo Monti (R)

Hasso Plattner Institute, University of Potsdam, Digital Engineering Faculty, Potsdam 14469, Germany.

Rahul Gupta (R)

Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, United States.

Ying Wang (Y)

Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, United States.

Alicia Martin (A)

Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, United States.

Luke J O'Connor (LJ)

Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, United States.

Samuel Kaski (S)

Department of Computer Science, Aalto University, Espoo 02150, Finland.
Department of Computer Science, University of Manchester, Manchester M13 9PL, United Kingdom.

Pekka Marttinen (P)

Department of Computer Science, Aalto University, Espoo 02150, Finland.

Pier Francesco Palamara (PF)

Department of Statistics, University of Oxford, Oxford OX1 2JD, United Kingdom.

Christoph Lippert (C)

Hasso Plattner Institute, University of Potsdam, Digital Engineering Faculty, Potsdam 14469, Germany.
Hasso Plattner Institute for Digital Health at Mount Sinai, Icahn School of Medicine at Mount Sinai, New York, New York 10065, United States.

Andrea Ganna (A)

Institute for Molecular Medicine Finland, FIMM, HiLIFE, University of Helsinki, Helsinki 00014, Finland.
Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, United States.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH