Computationally efficient whole-genome regression for quantitative and binary traits.
Journal
Nature genetics
ISSN: 1546-1718
Titre abrégé: Nat Genet
Pays: United States
ID NLM: 9216904
Informations de publication
Date de publication:
07 2021
07 2021
Historique:
received:
21
06
2020
accepted:
13
04
2021
pubmed:
22
5
2021
medline:
31
8
2021
entrez:
21
5
2021
Statut:
ppublish
Résumé
Genome-wide association analysis of cohorts with thousands of phenotypes is computationally expensive, particularly when accounting for sample relatedness or population structure. Here we present a novel machine-learning method called REGENIE for fitting a whole-genome regression model for quantitative and binary phenotypes that is substantially faster than alternatives in multi-trait analyses while maintaining statistical efficiency. The method naturally accommodates parallel analysis of multiple phenotypes and requires only local segments of the genotype matrix to be loaded in memory, in contrast to existing alternatives, which must load genome-wide matrices into memory. This results in substantial savings in compute time and memory usage. We introduce a fast, approximate Firth logistic regression test for unbalanced case-control phenotypes. The method is ideally suited to take advantage of distributed computing frameworks. We demonstrate the accuracy and computational benefits of this approach using the UK Biobank dataset with up to 407,746 individuals.
Identifiants
pubmed: 34017140
doi: 10.1038/s41588-021-00870-7
pii: 10.1038/s41588-021-00870-7
doi:
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Pagination
1097-1103Subventions
Organisme : Medical Research Council
ID : MC_PC_17228
Pays : United Kingdom
Organisme : Medical Research Council
ID : MC_QA137853
Pays : United Kingdom
Informations de copyright
© 2021. The Author(s), under exclusive licence to Springer Nature America, Inc.
Références
The Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661–678 (2007).
doi: 10.1038/nature05911
Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).
doi: 10.1086/519795
Kang, H. M. et al. Variance component model to account for sample structure in genome-wide association studies. Nat. Genet. 42, 348–354 (2010).
doi: 10.1038/ng.548
Price, A. L., Zaitlen, N. A., Reich, D. & Patterson, N. New approaches to population stratification in genome-wide association studies. Nat. Rev. Genet. 11, 459–463 (2010).
doi: 10.1038/nrg2813
Yu, J. et al. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nat. Genet. 38, 203–208 (2006).
doi: 10.1038/ng1702
Zhang, Z. et al. Mixed linear model approach adapted for genome-wide association studies. Nat. Genet. 42, 355–360 (2010).
doi: 10.1038/ng.546
Zhou, X. & Stephens, M. Genome-wide efficient mixed-model analysis for association studies. Nat. Genet. 44, 821–824 (2012).
doi: 10.1038/ng.2310
Listgarten, J. et al. Improved linear mixed models for genome-wide association studies. Nat. Methods 9, 525–526 (2012).
doi: 10.1038/nmeth.2037
Meuwissen, T. H., Hayes, B. J. & Goddard, M. E. Prediction of total genetic value using genome-wide dense marker maps. Genetics 157, 1819–1829 (2001).
doi: 10.1093/genetics/157.4.1819
Campos, G. d. L., Hickey, J. M., Pong-Wong, R., Daetwyler, H. D. & Calus, M. P. L. Whole genome regression and prediction methods applied to plant and animal breeding. Genetics 193, 327–345 (2012).
doi: 10.1534/genetics.112.143313
Logsdon, B. A., Hoffman, G. E. & Mezey, J. G. A variational Bayes algorithm for fast and accurate multiple locus genome-wide association analysis. BMC Bioinform. 11, 58 (2010).
doi: 10.1186/1471-2105-11-58
Carbonetto, P. & Stephens, M. Scalable variational inference for Bayesian variable selection in regression, and its accuracy in genetic association studies. Bayesian Anal. 7, 73–108 (2012).
doi: 10.1214/12-BA703
Loh, P.-R. et al. Efficient Bayesian mixed-model analysis increases association power in large cohorts. Nat. Genet. 47, 284–290 (2015).
doi: 10.1038/ng.3190
Loh, P.-R., Kichaev, G., Gazal, S., Schoech, A. P. & Price, A. L. Mixed-model association for biobank-scale datasets. Nat. Genet. 50, 906–908 (2018).
doi: 10.1038/s41588-018-0144-6
Kerin, M. & Marchini, J. Inferring gene-by-environment interactions with a Bayesian whole-genome regression model. Am. J. Hum. Genet. 107, 698–713 (2020).
doi: 10.1016/j.ajhg.2020.08.009
Jiang, L. et al. A resource-efficient tool for mixed model association analysis of large-scale data. Nat. Genet. 51, 1749–1755 (2019).
doi: 10.1038/s41588-019-0530-8
Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).
doi: 10.1038/s41586-018-0579-z
Zhou, W. et al. Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies. Nat. Genet. 50, 1335–1341 (2018).
doi: 10.1038/s41588-018-0184-y
Yang, J., Zaitlen, N. A., Goddard, M. E., Visscher, P. M. & Price, A. L. Advantages and pitfalls in the application of mixed-model association methods. Nat. Genet. 46, 100–106 (2014).
doi: 10.1038/ng.2876
Kunert-Graf, J., Sakhanenko, N. & Galas, D. Allele frequency mismatches and apparent mismappings in UK Biobank SNP data. Preprint at bioRxiv https://doi.org/10.1101/2020.08.03.235150 (2020).
Svishcheva, G. R., Axenovich, T. I., Belonogova, N. M., van Duijn, C. M. & Aulchenko, Y. S. Rapid variance components-based method for whole-genome association analysis. Nat. Genet. 44, 1166–1170 (2012).
doi: 10.1038/ng.2410
Breiman, L. Stacked regressions. Mach. Learn. 24, 49–64 (1996).
Young, A. I., Wauthier, F. L. & Donnelly, P. Identifying loci affecting trait variability and detecting interactions in genome-wide association studies. Nat. Genet. 50, 1608–1614 (2018).
doi: 10.1038/s41588-018-0225-6
Wu, M. C. et al. Rare-variant association testing for sequencing data with the sequence kernel association test. Am. J. Hum. Genet. 89, 82–93 (2011).
doi: 10.1016/j.ajhg.2011.05.029
Lee, S., Wu, M. C. & Lin, X. Optimal tests for rare variant effects in sequencing association studies. Biostatistics 13, 762–775 (2012).
doi: 10.1093/biostatistics/kxs014
Zhou, W. et al. Scalable generalized linear mixed model for region-based association tests in large biobanks and cohorts. Nat. Genet. 52, 634–639 (2020).
doi: 10.1038/s41588-020-0621-6
Chib, S. & Greenberg, E. Analysis of multivariate probit models. Biometrika 85, 347–361 (1998).
doi: 10.1093/biomet/85.2.347
Korte, A. et al. A mixed-model approach for genome-wide association studies of correlated traits in structured populations. Nat. Genet. 44, 1066–1071 (2012).
doi: 10.1038/ng.2376
Dutta, D., Scott, L., Boehnke, M. & Lee, S. Multi-SKAT: general framework to test for rare-variant association with multiple phenotypes. Genet. Epidemiol. 43, 4–23 (2018).
doi: 10.1002/gepi.22156
Rizvi, A. A. et al. gwasurvivr: an R package for genome wide survival analysis. Bioinformatics 35, 1968–1970 (2018).
doi: 10.1093/bioinformatics/bty920
Morris, A. P. et al. A powerful approach to sub-phenotype analysis in population-based genetic association studies. Genet. Epidemiol. 34, 335–343 (2010).
doi: 10.1002/gepi.20486
Jostins, L. & McVean, G. Trinculo: Bayesian and frequentist multinomial logistic regression for genome-wide association studies of multi-category phenotypes. Bioinformatics 32, 1898–1900 (2016).
doi: 10.1093/bioinformatics/btw075
Dahl, A. et al. A multiple-phenotype imputation method for genetic studies. Nat. Genet. 48, 466–472 (2016).
doi: 10.1038/ng.3513
Kang, H. M., Ye, C. & Eskin, E. Accurate discovery of expression quantitative trait loci under confounding from spurious and genuine regulatory hotspots. Genetics 180, 1909–1925 (2008).
doi: 10.1534/genetics.108.094201
Shang, L. et al. Genetic architecture of gene expression in European and African Americans: an eQTL mapping study in GENOA. Am. J. Hum. Genet. 106, 496–512 (2020).
doi: 10.1016/j.ajhg.2020.03.002
Robinson, G. K. That BLUP is a good thing: the estimation of random effects. Stat. Sci. 6, 15–32 (1991).
Dey, R., Schmidt, E. M., Abecasis, G. R. & Lee, S. A fast and accurate algorithm to test for binary phenotypes and its application to PheWAS. Am. J. Hum. Genet. 101, 37–49 (2017).
doi: 10.1016/j.ajhg.2017.05.014
Horowitz, J. E. et al. Common genetic variants identify therapeutic targets for COVID-19 and individuals at high risk of severe disease. Preprint at medRxiv https://doi.org/10.1101/2020.12.14.20248176 (2020).
Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4, 7 (2015).
doi: 10.1186/s13742-015-0047-8
Manichaikul, A. et al. Robust relationship inference in genome-wide association studies. Bioinformatics 26, 2867–2873 (2010).
doi: 10.1093/bioinformatics/btq559
R Core Team. R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, 2013).