Haplotype and population structure inference using neural networks in whole-genome sequencing data.


Journal

Genome research
ISSN: 1549-5469
Titre abrégé: Genome Res
Pays: United States
ID NLM: 9518021

Informations de publication

Date de publication:
06 Jul 2022
Historique:
received: 03 04 2022
accepted: 28 06 2022
pubmed: 7 7 2022
medline: 7 7 2022
entrez: 6 7 2022
Statut: aheadofprint

Résumé

Accurate inference of population structure is important in many studies of population genetics. Here we present HaploNet, a method for performing dimensionality reduction and clustering of genetic data. The method is based on local clustering of phased haplotypes using neural networks from whole-genome sequencing or dense genotype data. By using Gaussian mixtures in a variational autoencoder framework, we are able to learn a low-dimensional latent space in which we cluster haplotypes along the genome in a highly scalable manner. We show that we can use haplotype clusters in the latent space to infer global population structure using haplotype information by exploiting the generative properties of our framework. Based on fitted neural networks and their latent haplotype clusters, we can perform principal component analysis and estimate ancestry proportions based on a maximum likelihood framework. Using sequencing data from simulations and closely related human populations, we show that our approach is better at distinguishing closely related populations than standard admixture and principal component analysis software. We further show that HaploNet is fast and highly scalable by applying it to genotype array data of the UK Biobank.

Identifiants

pubmed: 35794006
pii: gr.276813.122
doi: 10.1101/gr.276813.122
pmc: PMC9435741
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Subventions

Organisme : Medical Research Council
ID : MC_PC_17228
Pays : United Kingdom
Organisme : Medical Research Council
ID : MC_QA137853
Pays : United Kingdom

Informations de copyright

© 2022 Meisner and Albrechtsen; Published by Cold Spring Harbor Laboratory Press.

Références

BMC Genet. 2020 Mar 17;21(1):34
pubmed: 32183706
Adv Neural Inf Process Syst. 2018 Dec;31:8594-8605
pubmed: 33244210
Mol Biol Evol. 2019 Mar 1;36(3):632-637
pubmed: 30517680
Nat Commun. 2020 Nov 30;11(1):6130
pubmed: 33257650
Annu Rev Genomics Hum Genet. 2012;13:337-61
pubmed: 22703172
PLoS Comput Biol. 2016 Mar 28;12(3):e1004845
pubmed: 27018908
Nature. 2020 Sep;585(7825):357-362
pubmed: 32939066
Mol Ecol Resour. 2021 Nov;21(8):2689-2705
pubmed: 33745225
Genetics. 2003 Dec;165(4):2213-33
pubmed: 14704198
Trends Genet. 2018 Apr;34(4):301-312
pubmed: 29331490
G3 (Bethesda). 2021 Jan 18;11(1):
pubmed: 33561250
Mol Biol Evol. 2019 Feb 1;36(2):220-238
pubmed: 30517664
Bioinformatics. 2020 Aug 15;36(16):4449-4457
pubmed: 32415959
G3 (Bethesda). 2022 Mar 4;12(3):
pubmed: 35078229
R Soc Open Sci. 2019 Nov 6;6(11):190666
pubmed: 31827824
Gigascience. 2015 Feb 25;4:7
pubmed: 25722852
Elife. 2021 May 25;10:
pubmed: 34032215
Methods Mol Biol. 2020;2090:191-230
pubmed: 31975169
PLoS Genet. 2019 Nov 1;15(11):e1008432
pubmed: 31675358
PLoS Genet. 2013;9(12):e1004023
pubmed: 24385924
PLoS Genet. 2006 Dec;2(12):e190
pubmed: 17194218
Genetics. 2013 Nov;195(3):693-702
pubmed: 24026093
PLoS Genet. 2012 Jan;8(1):e1002453
pubmed: 22291602
Nature. 2015 Oct 1;526(7571):68-74
pubmed: 26432245
Genetics. 2000 Jun;155(2):945-59
pubmed: 10835412
PLoS Genet. 2021 Feb 4;17(2):e1009303
pubmed: 33539374
Genome Res. 2009 Sep;19(9):1655-64
pubmed: 19648217
PLoS Genet. 2009 Oct;5(10):e1000686
pubmed: 19834557
Science. 2012 Jul 6;337(6090):64-9
pubmed: 22604720
Genet Epidemiol. 2005 May;28(4):289-301
pubmed: 15712363

Auteurs

Jonas Meisner (J)

Department of Biology, Bioinformatics Center, University of Copenhagen, DK-2200 Copenhagen, Denmark.

Anders Albrechtsen (A)

Department of Biology, Bioinformatics Center, University of Copenhagen, DK-2200 Copenhagen, Denmark.

Classifications MeSH