Computationally efficient whole-genome regression for quantitative and binary traits.

Case-Control Studies Computational Biology / methods Genome-Wide Association Study / methods Genomics / methods Genotype Humans Logistic Models Machine Learning Phenotype Reproducibility of Results

Journal

Nature genetics

ISSN: 1546-1718

Titre abrégé: Nat Genet

Pays: United States

ID NLM: 9216904

Informations de publication

Date de publication:
07 2021

Historique:

received: 21 06 2020

accepted: 13 04 2021

pubmed: 22 5 2021

medline: 31 8 2021

entrez: 21 5 2021

Statut: ppublish

Résumé

Genome-wide association analysis of cohorts with thousands of phenotypes is computationally expensive, particularly when accounting for sample relatedness or population structure. Here we present a novel machine-learning method called REGENIE for fitting a whole-genome regression model for quantitative and binary phenotypes that is substantially faster than alternatives in multi-trait analyses while maintaining statistical efficiency. The method naturally accommodates parallel analysis of multiple phenotypes and requires only local segments of the genotype matrix to be loaded in memory, in contrast to existing alternatives, which must load genome-wide matrices into memory. This results in substantial savings in compute time and memory usage. We introduce a fast, approximate Firth logistic regression test for unbalanced case-control phenotypes. The method is ideally suited to take advantage of distributed computing frameworks. We demonstrate the accuracy and computational benefits of this approach using the UK Biobank dataset with up to 407,746 individuals.

Identifiants

DOI: 10.1038/s41588-021-00870-7 PMID: 34017140

pubmed: 34017140

doi: 10.1038/s41588-021-00870-7

pii: 10.1038/s41588-021-00870-7

doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

Pagination

1097-1103

Subventions

Organisme : Medical Research Council

ID : MC_PC_17228

Pays : United Kingdom

Organisme : Medical Research Council

ID : MC_QA137853

Pays : United Kingdom

Informations de copyright

Références

The Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 447, 661–678 (2007).

doi: 10.1038/nature05911

Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–575 (2007).

doi: 10.1086/519795

Kang, H. M. et al. Variance component model to account for sample structure in genome-wide association studies. Nat. Genet. 42, 348–354 (2010).

doi: 10.1038/ng.548

Price, A. L., Zaitlen, N. A., Reich, D. & Patterson, N. New approaches to population stratification in genome-wide association studies. Nat. Rev. Genet. 11, 459–463 (2010).

doi: 10.1038/nrg2813

Yu, J. et al. A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nat. Genet. 38, 203–208 (2006).

doi: 10.1038/ng1702

Zhang, Z. et al. Mixed linear model approach adapted for genome-wide association studies. Nat. Genet. 42, 355–360 (2010).

doi: 10.1038/ng.546

Zhou, X. & Stephens, M. Genome-wide efficient mixed-model analysis for association studies. Nat. Genet. 44, 821–824 (2012).

doi: 10.1038/ng.2310

Listgarten, J. et al. Improved linear mixed models for genome-wide association studies. Nat. Methods 9, 525–526 (2012).

doi: 10.1038/nmeth.2037

Meuwissen, T. H., Hayes, B. J. & Goddard, M. E. Prediction of total genetic value using genome-wide dense marker maps. Genetics 157, 1819–1829 (2001).

doi: 10.1093/genetics/157.4.1819

Campos, G. d. L., Hickey, J. M., Pong-Wong, R., Daetwyler, H. D. & Calus, M. P. L. Whole genome regression and prediction methods applied to plant and animal breeding. Genetics 193, 327–345 (2012).

doi: 10.1534/genetics.112.143313

Logsdon, B. A., Hoffman, G. E. & Mezey, J. G. A variational Bayes algorithm for fast and accurate multiple locus genome-wide association analysis. BMC Bioinform. 11, 58 (2010).

doi: 10.1186/1471-2105-11-58

Carbonetto, P. & Stephens, M. Scalable variational inference for Bayesian variable selection in regression, and its accuracy in genetic association studies. Bayesian Anal. 7, 73–108 (2012).

doi: 10.1214/12-BA703

Loh, P.-R. et al. Efficient Bayesian mixed-model analysis increases association power in large cohorts. Nat. Genet. 47, 284–290 (2015).

doi: 10.1038/ng.3190

Loh, P.-R., Kichaev, G., Gazal, S., Schoech, A. P. & Price, A. L. Mixed-model association for biobank-scale datasets. Nat. Genet. 50, 906–908 (2018).

doi: 10.1038/s41588-018-0144-6

Kerin, M. & Marchini, J. Inferring gene-by-environment interactions with a Bayesian whole-genome regression model. Am. J. Hum. Genet. 107, 698–713 (2020).

doi: 10.1016/j.ajhg.2020.08.009

Jiang, L. et al. A resource-efficient tool for mixed model association analysis of large-scale data. Nat. Genet. 51, 1749–1755 (2019).

doi: 10.1038/s41588-019-0530-8

Bycroft, C. et al. The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018).

doi: 10.1038/s41586-018-0579-z

Zhou, W. et al. Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies. Nat. Genet. 50, 1335–1341 (2018).

doi: 10.1038/s41588-018-0184-y

Yang, J., Zaitlen, N. A., Goddard, M. E., Visscher, P. M. & Price, A. L. Advantages and pitfalls in the application of mixed-model association methods. Nat. Genet. 46, 100–106 (2014).

doi: 10.1038/ng.2876

Kunert-Graf, J., Sakhanenko, N. & Galas, D. Allele frequency mismatches and apparent mismappings in UK Biobank SNP data. Preprint at bioRxiv https://doi.org/10.1101/2020.08.03.235150 (2020).

Svishcheva, G. R., Axenovich, T. I., Belonogova, N. M., van Duijn, C. M. & Aulchenko, Y. S. Rapid variance components-based method for whole-genome association analysis. Nat. Genet. 44, 1166–1170 (2012).

doi: 10.1038/ng.2410

Breiman, L. Stacked regressions. Mach. Learn. 24, 49–64 (1996).

Young, A. I., Wauthier, F. L. & Donnelly, P. Identifying loci affecting trait variability and detecting interactions in genome-wide association studies. Nat. Genet. 50, 1608–1614 (2018).

doi: 10.1038/s41588-018-0225-6

Wu, M. C. et al. Rare-variant association testing for sequencing data with the sequence kernel association test. Am. J. Hum. Genet. 89, 82–93 (2011).

doi: 10.1016/j.ajhg.2011.05.029

Lee, S., Wu, M. C. & Lin, X. Optimal tests for rare variant effects in sequencing association studies. Biostatistics 13, 762–775 (2012).

doi: 10.1093/biostatistics/kxs014

Zhou, W. et al. Scalable generalized linear mixed model for region-based association tests in large biobanks and cohorts. Nat. Genet. 52, 634–639 (2020).

doi: 10.1038/s41588-020-0621-6

Chib, S. & Greenberg, E. Analysis of multivariate probit models. Biometrika 85, 347–361 (1998).

doi: 10.1093/biomet/85.2.347

Korte, A. et al. A mixed-model approach for genome-wide association studies of correlated traits in structured populations. Nat. Genet. 44, 1066–1071 (2012).

doi: 10.1038/ng.2376

Dutta, D., Scott, L., Boehnke, M. & Lee, S. Multi-SKAT: general framework to test for rare-variant association with multiple phenotypes. Genet. Epidemiol. 43, 4–23 (2018).

doi: 10.1002/gepi.22156

Rizvi, A. A. et al. gwasurvivr: an R package for genome wide survival analysis. Bioinformatics 35, 1968–1970 (2018).

doi: 10.1093/bioinformatics/bty920

Morris, A. P. et al. A powerful approach to sub-phenotype analysis in population-based genetic association studies. Genet. Epidemiol. 34, 335–343 (2010).

doi: 10.1002/gepi.20486

Jostins, L. & McVean, G. Trinculo: Bayesian and frequentist multinomial logistic regression for genome-wide association studies of multi-category phenotypes. Bioinformatics 32, 1898–1900 (2016).

doi: 10.1093/bioinformatics/btw075

Dahl, A. et al. A multiple-phenotype imputation method for genetic studies. Nat. Genet. 48, 466–472 (2016).

doi: 10.1038/ng.3513

Kang, H. M., Ye, C. & Eskin, E. Accurate discovery of expression quantitative trait loci under confounding from spurious and genuine regulatory hotspots. Genetics 180, 1909–1925 (2008).

doi: 10.1534/genetics.108.094201

Shang, L. et al. Genetic architecture of gene expression in European and African Americans: an eQTL mapping study in GENOA. Am. J. Hum. Genet. 106, 496–512 (2020).

doi: 10.1016/j.ajhg.2020.03.002

Robinson, G. K. That BLUP is a good thing: the estimation of random effects. Stat. Sci. 6, 15–32 (1991).

Dey, R., Schmidt, E. M., Abecasis, G. R. & Lee, S. A fast and accurate algorithm to test for binary phenotypes and its application to PheWAS. Am. J. Hum. Genet. 101, 37–49 (2017).

doi: 10.1016/j.ajhg.2017.05.014

Horowitz, J. E. et al. Common genetic variants identify therapeutic targets for COVID-19 and individuals at high risk of severe disease. Preprint at medRxiv https://doi.org/10.1101/2020.12.14.20248176 (2020).

Chang, C. C. et al. Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4, 7 (2015).

doi: 10.1186/s13742-015-0047-8

Manichaikul, A. et al. Robust relationship inference in genome-wide association studies. Bioinformatics 26, 2867–2873 (2010).

doi: 10.1093/bioinformatics/btq559

R Core Team. R: A Language and Environment for Statistical Computing (R Foundation for Statistical Computing, 2013).

Computationally efficient whole-genome regression for quantitative and binary traits.

Journal

Informations de publication

Résumé

Identifiants

Types de publication

Langues

Sous-ensembles de citation

Pagination

Subventions

Informations de copyright

Références

Auteurs

Joelle Mbatchou (J)

Leland Barnard (L)

Joshua Backman (J)

Anthony Marcketta (A)

Jack A Kosmicki (JA)

Andrey Ziyatdinov (A)

Christian Benner (C)

Colm O'Dushlaine (C)

Mathew Barber (M)

Boris Boutkov (B)

Lukas Habegger (L)

Manuel Ferreira (M)

Aris Baras (A)

Jeffrey Reid (J)

Goncalo Abecasis (G)

Evan Maxwell (E)

Jonathan Marchini (J)

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Smoking Cessation and Incident Cardiovascular Disease.

Evaluation of Low-Value Services Across Major Medicare Advantage Insurers and Traditional Medicare.

Effectiveness of Virtual Yoga for Chronic Low Back Pain: A Randomized Clinical Trial.

Classifications MeSH