Penalized linear mixed models for structured genetic data.

Humans Linear Models Models, Genetic Phenotype Polymorphism, Single Nucleotide

confounding lasso linear mixed model penalized regression population stratification

Journal

Genetic epidemiology

ISSN: 1098-2272

Titre abrégé: Genet Epidemiol

Pays: United States

ID NLM: 8411723

Informations de publication

Date de publication:
07 2021

Historique:

revised: 19 03 2021

received: 18 12 2020

accepted: 29 03 2021

pubmed: 18 5 2021

medline: 26 10 2021

entrez: 17 5 2021

Statut: ppublish

Résumé

Many genetic studies that aim to identify genetic variants associated with complex phenotypes are subject to unobserved confounding factors arising from environmental heterogeneity. This poses a challenge to detecting associations of interest and is known to induce spurious associations when left unaccounted for. Penalized linear mixed models (LMMs) are an attractive method to correct for unobserved confounding. These methods correct for varying levels of relatedness and population structure by modeling it as a random effect with a covariance structure estimated from observed genetic data. Despite an extensive literature on penalized regression and LMMs separately, the two are rarely discussed together. The aim of this review is to do so while examining the statistical properties of penalized LMMs in the genetic association setting. Specifically, the ability of penalized LMMs to accurately estimate genetic effects in the presence of environmental confounding has not been well studied. To clarify the important yet subtle distinction between population structure and environmental heterogeneity, we present a detailed review of relevant concepts and methods. In addition, we evaluate the performance of penalized LMMs and competing methods in terms of estimation and selection accuracy in the presence of a number of confounding structures.

Identifiants

DOI: 10.1002/gepi.22384 PMID: 33998038

pubmed: 33998038

doi: 10.1002/gepi.22384

doi:

Types de publication

Journal Article Review

Langues

eng

Sous-ensembles de citation

Pagination

427-444

Informations de copyright

Références

Amin, N. , Van Duijn, C. M. , & Aulchenko, Y. S. (2007). A genomic background based method for association analysis in related individuals. PLOS One, 2, e1274.

Astle, W. , & Balding, D. J. (2009). Population structure and cryptic relatedness in genetic association studies. Statistical Science, 24, 451-471.

Bacanu, S.-A. , Devlin, B. , & Roeder, K. (2000). The power of genomic control. The American Journal of Human Genetics, 66, 1933-1944.

Barton, N. , Hermisson, J. , & Nordborg, M. (2019). Population genetics: Why structure matters. Elife, 8, e45380.

Bhatnagar, S. R. , Yang, Y. , Lu, T. , Schurr, E. , Loredo-Osti, J. , Forest, M. , Oualkacha, K. , & Greenwood, C. M. (2020). Simultaneous SNP selection and adjustment for population structure in high dimensional prediction models. PLOS Genetics, 16, e1008766.

Bondell, H. D. , Krishna, A. , & Ghosh, S. K. (2010). Joint variable selection for fixed and random effects in linear mixed-effects models. Biometrics, 66, 1069-1077.

Browning, S. R. , & Browning, B. L. (2011). Population structure can inflate SNP-based heritability estimates. The American Journal of Human Genetics, 89, 191-193.

Campbell, C. D. , Ogburn, E. L. , Lunetta, K. L. , Lyon, H. N. , Freedman, M. L. , Groop, L. C. , Altshuler, D. , Ardlie, K. G. , & Hirschhorn, J. N. (2005). Demonstrating stratification in a european american population. Nature genetics, 37, 868-872.

Chen, J. , & Chen, Z. (2008). Extended bayesian information criteria for model selection with large model spaces. Biometrika, 95, 759.

Chen, J. , Zheng, H. , Bei, J.-X. , Sun, L. , Jia, W.-h. , Li, T. , Zhang, F. , Seielstad, M. , Zeng, Y.-X. , Zhang, X. , & Liu, J. (2009). Genetic structure of the han chinese population revealed by genome-wide SNP variation. The American Journal of Human Genetics, 85, 775-785.

Devlin, B. , & Roeder, K. (1999). Genomic control for association studies. Biometrics, 55, 997-1004.

Fan, Y. , & Tang, C. Y. (2013). Tuning parameter selection in high dimensional penalized likelihood. Journal of the Royal Statistical Society: SERIES B: Statistical Methodology, 75, 531-552.

Friedman, J. , Hastie, T. , & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of statistical software, 33, 1.

Gibson, G. (2015). A primer of human genetics. Sinauer Associates, Incorporated, Publishers.

Greenland, S. , Pearl, J. , & Robins, J. M. (1999). Causal diagrams for epidemiologic research. Epidemiology, 10, 37-48.

Hastie, T. , Tibshirani, R. , & Friedman, J. (2009). The elements of statistical learning: Data mining, inference, and prediction, Springer.

Haworth, S. , Mitchell, R. , Corbin, L. , Wade, K. H. , Dudding, T. , Budu-Aggrey, A. , Carslake, D. , Hemani, G. , Paternoster, L. , Smith, G. D. , & Davies, N. (2019). Apparent latent structure within the UK biobank sample has implications for epidemiological analysis. Nature Communications, 10, 1-9.

Hayes, B. J. , Visscher, P. M. , & Goddard, M. E. (2009). Increased accuracy of artificial selection by using the realized relationship matrix. Genetics Research, 91, 47-60.

Hellwege, J. N. , Keaton, J. M. , Giri, A. , Gao, X. , Edwards, D. R. V. , & Edwards, T. L. (2017). Population stratification in genetic association studies. Current Protocols in Human Genetics, 95, 1-22.

Hoffman, G. E. (2013). Correcting for population structure and kinship using the linear mixed model: Theory and extensions. PLOS One, 8, e75707.

Kang, H. M. , Sul, J. H. , Service, S. K. , Zaitlen, N. A. , Kong, S.-y. , Freimer, N. B. , Sabatti, C. , & Eskin, E. (2010). Variance component model to account for sample structure in genome-wide association studies. Nature Genetics, 42, 348.

Kang, H. M. , Zaitlen, N. A. , Wade, C. M. , Kirby, A. , Heckerman, D. , Daly, M. J. , & Eskin, E. (2008). Efficient control of population structure in model organism association mapping. Genetics, 178, 1709-1723.

Larkin, E. K. , Gebretsadik, T. , Moore, M. L. , Anderson, L. J. , Dupont, W. D. , Chappell, J. D. , Minton, P. A. , Peebles, R. S. , Moore, P. E. , Valet, R. S. , & Arnold, D. H. (2015). Objectives, design and enrollment results from the infant susceptibility to pulmonary infections and asthma following RSV exposure study (INSPIRE). BMC Pulmonary Medicine, 15, 45.

Lawson, D. J. , Davies, N. M. , Haworth, S. , Ashraf, B. , Howe, L. , Crawford, A. , Hemani, G. , Smith, G. D. , & Timpson, N. J. (2019). Is population structure in the genetic biobank era irrelevant, a challenge, or an opportunity? Human Genetics, 1-19.

Ledoit, O. , & Wolf, M. (2004). A well-conditioned estimator for large-dimensional covariance matrices. Journal of Multivariate Analysis, 88, 365-411.

Lippert, C. , Listgarten, J. , Liu, Y. , Kadie, C. M. , Davidson, R. I. , & Heckerman, D. (2011). FaST linear mixed models for genome-wide association studies. Nature Methods, 8, 833.

Mandt, S. , Wenzel, F. , Nakajima, S. , Cunningham, J. , Lippert, C. , & Kloft, M. (2017). Sparse probit linear mixed model. Machine Learning, 106, 1621-1642.

Marchini, J. , Cardon, L. R. , Phillips, M. S. , & Donnelly, P. (2004). The effects of human population structure on large genetic association studies. Nature Genetics, 36, 512-517.

Ochoa, A. , & Storey, J. D. (2016). FST and kinship for arbitrary population structures I: Generalized definitions. BioRxiv, 083915.

Patterson, N. , Price, A. L. , & Reich, D. (2006). Population structure and eigenanalysis. PLOS Genetics, 2, e190.

Peterson, R. E. , Kuchenbaecker, K. , Walters, R. K. , Chen, C.-Y. , Popejoy, A. B. , Periyasamy, S. , Lam, M. , Iyegbe, C. , Strawbridge, R. J. , Brick, L. , & Carey, C. E. (2019). Genome-wide association studies in ancestrally diverse populations: Opportunities, methods, pitfalls, and recommendations. Cell, 179, 589-603.

Price, A. L. , Patterson, N. J. , Plenge, R. M. , Weinblatt, M. E. , Shadick, N. A. , & Reich, D. (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics, 38, 904.

Pritchard, J. K. , & Rosenberg, N. A. (1999). Use of unlinked genetic markers to detect population stratification in association studies. The American Journal of Human Genetics, 65, 220-228.

Pritchard, J. K. , Stephens, M. , Rosenberg, N. A. , & Donnelly, P. (2000). Association mapping in structured populations. The American Journal of Human Genetics, 67, 170-181.

Privé, F. , Aschard, H. , Ziyatdinov, A. , & Blum, M. G. (2018). Efficient analysis of large-scale genome-wide data with two R packages: Bigstatsr and bigsnpr. Bioinformatics, 34, 2781-2787.

Qian, J. , Tanigawa, Y. , Du, W. , Aguirre, M. , Chang, C. , Tibshirani, R. , Rivas, M. A. , & Hastie, T. (2020). A fast and scalable framework for large-scale and ultrahigh-dimensional sparse regression with application to the UK biobank. PLOS Genetics, 16, e1009141.

Rabinowicz, A. , & Rosset, S. (2020). Cross-validation for correlated data. Journal of the American Statistical Association, (0), 1-14.

Rakitsch, B. , Lippert, C. , Stegle, O. , & Borgwardt, K. (2013). A lasso multi-marker mixed model for association mapping with population structure correction. Bioinformatics, 29, 206-214.

Schelldorfer, J. , Bühlmann, P. , & DE GEER, S. V. (2011). Estimation for high-dimensional linear mixed-effects models using ℓ 1-penalization. Scandinavian Journal of Statistics, 38, 197-214.

Sesia, M. , Sabatti, C. , & Candés, E. J. (2018). Gene hunting with hidden markov model knockoffs. Biometrika, 106, 1-18.

Shevade, S. K. , & Keerthi, S. S. (2003). A simple and efficient algorithm for gene selection using sparse logistic regression. Bioinformatics, 19, 2246-2253.

Sillanpää, M. (2011). Overview of techniques to account for confounding due to population stratification and cryptic relatedness in genomic data association analyses. Heredity, 106, 511.

Sul, J. H. , Martin, L. S. , & Eskin, E. (2018). Population structure in genetic studies: Confounding factors and mixed models. PLOS Genetics, 14, e1007309.

Thornton, T. A. (2015). Statistical methods for genome-wide and sequencing association studies of complex traits in related samples. Current Protocols in Human Genetics, 84, 1-28.

Tibayrenc, M. (2017). Genetics and evolution of infectious diseases. Elsevier.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58, 267-288.

Tibshirani, R. (1997). The lasso method for variable selection in the cox model. Statistics in medicine, 16, 385-395.

Vilhjálmsson, B. J. , & Nordborg, M. (2012). The nature of confounding in genome-wide association studies. Nature Reviews Genetics, 14, 1.

Wang, H. , Liu, X. , Xiao, Y. , Xu, M. , & Xing, E. P. (2018). Multiplex confounding factor correction for genomic association mapping with squared sparse linear mixed model. Methods, 145, 33.

Wang, K. (2009). Testing for genetic association in the presence of population stratification in genome-wide association studies. The Official Publication of the International Society, 33, 637-645.

Wang, K. , Hu, X. , & Peng, Y. (2013). An analytical comparison of the principal component method and the mixed effects model for association studies in the presence of cryptic relatedness and population stratification. Human Heredity, 76, 1-9.

Wauthier, F. L. , Jojic, N. , & Jordan, M. I. (2013). A comparative framework for preconditioned lasso algorithms. Advances in Neural Information Processing Systems, 26, 1061-1069.

Xu, S. , Yin, X. , Li, S. , Jin, W. , Lou, H. , Yang, L. , Gong, X. , Wang, H. , Shen, Y. , Pan, X. , & He, Y. (2009). Genomic dissection of population substructure of han chinese and its implication in association studies. The American Journal of Human Genetics, 85, 762-774.

Yang, J. , Benyamin, B. , McEvoy, B. P. , Gordon, S. , Henders, A. K. , Nyholt, D. R. , Madden, P. A. , Heath, A. C. , Martin, N. G. , Montgomery, G. W. , Goddard, M. E. (2010). Common SNPs explain a large proportion of the heritability for human height. Nature Genetics, 42, 565.

Yang, J. , Lee, S. H. , Goddard, M. E. , & Visscher, P. M. (2011). GCTA: A tool for genome-wide complex trait analysis. The American Journal of Human Genetics, 88, 76-82.

Yang, J. , Zaitlen, N. A. , Goddard, M. E. , Visscher, P. M. , & Price, A. L. (2014). Advantages and pitfalls in the application of mixed-model association methods. Nature genetics, 46, 100-106.

Yu, J. , Pressoir, G. , Briggs, W. H. , Bi, I. V. , Yamasaki, M. , Doebley, J. F. , McMullen, M. D. , Gaut, B. S. , Nielsen, D. M. , Holland, J. B. , & Kresovich, S. (2006). A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nature Genetics, 38, 203.

Zhang, Y. , & Pan, W. (2015). Principal component regression and linear mixed model in association analysis of structured samples: Competitors or complements? Genetic Epidemiology, 39, 149-155.

Zhao, H. , Mitra, N. , Kanetsky, P. A. , Nathanson, K. L. , & Rebbeck, T. R. (2018). A practical approach to adjusting for population stratification in genome-wide association studies: Principal components and propensity scores (PCAPS). Statistical Applications in Genetics and Molecular Biology, 17(6).

Zhao, K. , Aranzana, M. J. , Kim, S. , Lister, C. , Shindo, C. , Tang, C. , Toomajian, C. , Zheng, H. , Dean, C. , Marjoram, P. , & Nordborg, M. (2007). An arabidopsis example of association mapping in structured samples. PLOS Genetics, 3, e4.

Penalized linear mixed models for structured genetic data.

Journal

Informations de publication

Résumé

Identifiants

Types de publication

Langues

Sous-ensembles de citation

Pagination

Informations de copyright

Références

Auteurs

Anna C Reisetter (AC)

Patrick Breheny (P)

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Smoking Cessation and Incident Cardiovascular Disease.

Evaluation of Low-Value Services Across Major Medicare Advantage Insurers and Traditional Medicare.

Effectiveness of Virtual Yoga for Chronic Low Back Pain: A Randomized Clinical Trial.

Classifications MeSH