Penalized linear mixed models for structured genetic data.
confounding
lasso
linear mixed model
penalized regression
population stratification
Journal
Genetic epidemiology
ISSN: 1098-2272
Titre abrégé: Genet Epidemiol
Pays: United States
ID NLM: 8411723
Informations de publication
Date de publication:
07 2021
07 2021
Historique:
revised:
19
03
2021
received:
18
12
2020
accepted:
29
03
2021
pubmed:
18
5
2021
medline:
26
10
2021
entrez:
17
5
2021
Statut:
ppublish
Résumé
Many genetic studies that aim to identify genetic variants associated with complex phenotypes are subject to unobserved confounding factors arising from environmental heterogeneity. This poses a challenge to detecting associations of interest and is known to induce spurious associations when left unaccounted for. Penalized linear mixed models (LMMs) are an attractive method to correct for unobserved confounding. These methods correct for varying levels of relatedness and population structure by modeling it as a random effect with a covariance structure estimated from observed genetic data. Despite an extensive literature on penalized regression and LMMs separately, the two are rarely discussed together. The aim of this review is to do so while examining the statistical properties of penalized LMMs in the genetic association setting. Specifically, the ability of penalized LMMs to accurately estimate genetic effects in the presence of environmental confounding has not been well studied. To clarify the important yet subtle distinction between population structure and environmental heterogeneity, we present a detailed review of relevant concepts and methods. In addition, we evaluate the performance of penalized LMMs and competing methods in terms of estimation and selection accuracy in the presence of a number of confounding structures.
Types de publication
Journal Article
Review
Langues
eng
Sous-ensembles de citation
IM
Pagination
427-444Informations de copyright
© 2021 Wiley Periodicals LLC.
Références
Amin, N. , Van Duijn, C. M. , & Aulchenko, Y. S. (2007). A genomic background based method for association analysis in related individuals. PLOS One, 2, e1274.
Astle, W. , & Balding, D. J. (2009). Population structure and cryptic relatedness in genetic association studies. Statistical Science, 24, 451-471.
Bacanu, S.-A. , Devlin, B. , & Roeder, K. (2000). The power of genomic control. The American Journal of Human Genetics, 66, 1933-1944.
Barton, N. , Hermisson, J. , & Nordborg, M. (2019). Population genetics: Why structure matters. Elife, 8, e45380.
Bhatnagar, S. R. , Yang, Y. , Lu, T. , Schurr, E. , Loredo-Osti, J. , Forest, M. , Oualkacha, K. , & Greenwood, C. M. (2020). Simultaneous SNP selection and adjustment for population structure in high dimensional prediction models. PLOS Genetics, 16, e1008766.
Bondell, H. D. , Krishna, A. , & Ghosh, S. K. (2010). Joint variable selection for fixed and random effects in linear mixed-effects models. Biometrics, 66, 1069-1077.
Browning, S. R. , & Browning, B. L. (2011). Population structure can inflate SNP-based heritability estimates. The American Journal of Human Genetics, 89, 191-193.
Campbell, C. D. , Ogburn, E. L. , Lunetta, K. L. , Lyon, H. N. , Freedman, M. L. , Groop, L. C. , Altshuler, D. , Ardlie, K. G. , & Hirschhorn, J. N. (2005). Demonstrating stratification in a european american population. Nature genetics, 37, 868-872.
Chen, J. , & Chen, Z. (2008). Extended bayesian information criteria for model selection with large model spaces. Biometrika, 95, 759.
Chen, J. , Zheng, H. , Bei, J.-X. , Sun, L. , Jia, W.-h. , Li, T. , Zhang, F. , Seielstad, M. , Zeng, Y.-X. , Zhang, X. , & Liu, J. (2009). Genetic structure of the han chinese population revealed by genome-wide SNP variation. The American Journal of Human Genetics, 85, 775-785.
Devlin, B. , & Roeder, K. (1999). Genomic control for association studies. Biometrics, 55, 997-1004.
Fan, Y. , & Tang, C. Y. (2013). Tuning parameter selection in high dimensional penalized likelihood. Journal of the Royal Statistical Society: SERIES B: Statistical Methodology, 75, 531-552.
Friedman, J. , Hastie, T. , & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of statistical software, 33, 1.
Gibson, G. (2015). A primer of human genetics. Sinauer Associates, Incorporated, Publishers.
Greenland, S. , Pearl, J. , & Robins, J. M. (1999). Causal diagrams for epidemiologic research. Epidemiology, 10, 37-48.
Hastie, T. , Tibshirani, R. , & Friedman, J. (2009). The elements of statistical learning: Data mining, inference, and prediction, Springer.
Haworth, S. , Mitchell, R. , Corbin, L. , Wade, K. H. , Dudding, T. , Budu-Aggrey, A. , Carslake, D. , Hemani, G. , Paternoster, L. , Smith, G. D. , & Davies, N. (2019). Apparent latent structure within the UK biobank sample has implications for epidemiological analysis. Nature Communications, 10, 1-9.
Hayes, B. J. , Visscher, P. M. , & Goddard, M. E. (2009). Increased accuracy of artificial selection by using the realized relationship matrix. Genetics Research, 91, 47-60.
Hellwege, J. N. , Keaton, J. M. , Giri, A. , Gao, X. , Edwards, D. R. V. , & Edwards, T. L. (2017). Population stratification in genetic association studies. Current Protocols in Human Genetics, 95, 1-22.
Hoffman, G. E. (2013). Correcting for population structure and kinship using the linear mixed model: Theory and extensions. PLOS One, 8, e75707.
Kang, H. M. , Sul, J. H. , Service, S. K. , Zaitlen, N. A. , Kong, S.-y. , Freimer, N. B. , Sabatti, C. , & Eskin, E. (2010). Variance component model to account for sample structure in genome-wide association studies. Nature Genetics, 42, 348.
Kang, H. M. , Zaitlen, N. A. , Wade, C. M. , Kirby, A. , Heckerman, D. , Daly, M. J. , & Eskin, E. (2008). Efficient control of population structure in model organism association mapping. Genetics, 178, 1709-1723.
Larkin, E. K. , Gebretsadik, T. , Moore, M. L. , Anderson, L. J. , Dupont, W. D. , Chappell, J. D. , Minton, P. A. , Peebles, R. S. , Moore, P. E. , Valet, R. S. , & Arnold, D. H. (2015). Objectives, design and enrollment results from the infant susceptibility to pulmonary infections and asthma following RSV exposure study (INSPIRE). BMC Pulmonary Medicine, 15, 45.
Lawson, D. J. , Davies, N. M. , Haworth, S. , Ashraf, B. , Howe, L. , Crawford, A. , Hemani, G. , Smith, G. D. , & Timpson, N. J. (2019). Is population structure in the genetic biobank era irrelevant, a challenge, or an opportunity? Human Genetics, 1-19.
Ledoit, O. , & Wolf, M. (2004). A well-conditioned estimator for large-dimensional covariance matrices. Journal of Multivariate Analysis, 88, 365-411.
Lippert, C. , Listgarten, J. , Liu, Y. , Kadie, C. M. , Davidson, R. I. , & Heckerman, D. (2011). FaST linear mixed models for genome-wide association studies. Nature Methods, 8, 833.
Mandt, S. , Wenzel, F. , Nakajima, S. , Cunningham, J. , Lippert, C. , & Kloft, M. (2017). Sparse probit linear mixed model. Machine Learning, 106, 1621-1642.
Marchini, J. , Cardon, L. R. , Phillips, M. S. , & Donnelly, P. (2004). The effects of human population structure on large genetic association studies. Nature Genetics, 36, 512-517.
Ochoa, A. , & Storey, J. D. (2016). FST and kinship for arbitrary population structures I: Generalized definitions. BioRxiv, 083915.
Patterson, N. , Price, A. L. , & Reich, D. (2006). Population structure and eigenanalysis. PLOS Genetics, 2, e190.
Peterson, R. E. , Kuchenbaecker, K. , Walters, R. K. , Chen, C.-Y. , Popejoy, A. B. , Periyasamy, S. , Lam, M. , Iyegbe, C. , Strawbridge, R. J. , Brick, L. , & Carey, C. E. (2019). Genome-wide association studies in ancestrally diverse populations: Opportunities, methods, pitfalls, and recommendations. Cell, 179, 589-603.
Price, A. L. , Patterson, N. J. , Plenge, R. M. , Weinblatt, M. E. , Shadick, N. A. , & Reich, D. (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics, 38, 904.
Pritchard, J. K. , & Rosenberg, N. A. (1999). Use of unlinked genetic markers to detect population stratification in association studies. The American Journal of Human Genetics, 65, 220-228.
Pritchard, J. K. , Stephens, M. , Rosenberg, N. A. , & Donnelly, P. (2000). Association mapping in structured populations. The American Journal of Human Genetics, 67, 170-181.
Privé, F. , Aschard, H. , Ziyatdinov, A. , & Blum, M. G. (2018). Efficient analysis of large-scale genome-wide data with two R packages: Bigstatsr and bigsnpr. Bioinformatics, 34, 2781-2787.
Qian, J. , Tanigawa, Y. , Du, W. , Aguirre, M. , Chang, C. , Tibshirani, R. , Rivas, M. A. , & Hastie, T. (2020). A fast and scalable framework for large-scale and ultrahigh-dimensional sparse regression with application to the UK biobank. PLOS Genetics, 16, e1009141.
Rabinowicz, A. , & Rosset, S. (2020). Cross-validation for correlated data. Journal of the American Statistical Association, (0), 1-14.
Rakitsch, B. , Lippert, C. , Stegle, O. , & Borgwardt, K. (2013). A lasso multi-marker mixed model for association mapping with population structure correction. Bioinformatics, 29, 206-214.
Schelldorfer, J. , Bühlmann, P. , & DE GEER, S. V. (2011). Estimation for high-dimensional linear mixed-effects models using ℓ 1-penalization. Scandinavian Journal of Statistics, 38, 197-214.
Sesia, M. , Sabatti, C. , & Candés, E. J. (2018). Gene hunting with hidden markov model knockoffs. Biometrika, 106, 1-18.
Shevade, S. K. , & Keerthi, S. S. (2003). A simple and efficient algorithm for gene selection using sparse logistic regression. Bioinformatics, 19, 2246-2253.
Sillanpää, M. (2011). Overview of techniques to account for confounding due to population stratification and cryptic relatedness in genomic data association analyses. Heredity, 106, 511.
Sul, J. H. , Martin, L. S. , & Eskin, E. (2018). Population structure in genetic studies: Confounding factors and mixed models. PLOS Genetics, 14, e1007309.
Thornton, T. A. (2015). Statistical methods for genome-wide and sequencing association studies of complex traits in related samples. Current Protocols in Human Genetics, 84, 1-28.
Tibayrenc, M. (2017). Genetics and evolution of infectious diseases. Elsevier.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58, 267-288.
Tibshirani, R. (1997). The lasso method for variable selection in the cox model. Statistics in medicine, 16, 385-395.
Vilhjálmsson, B. J. , & Nordborg, M. (2012). The nature of confounding in genome-wide association studies. Nature Reviews Genetics, 14, 1.
Wang, H. , Liu, X. , Xiao, Y. , Xu, M. , & Xing, E. P. (2018). Multiplex confounding factor correction for genomic association mapping with squared sparse linear mixed model. Methods, 145, 33.
Wang, K. (2009). Testing for genetic association in the presence of population stratification in genome-wide association studies. The Official Publication of the International Society, 33, 637-645.
Wang, K. , Hu, X. , & Peng, Y. (2013). An analytical comparison of the principal component method and the mixed effects model for association studies in the presence of cryptic relatedness and population stratification. Human Heredity, 76, 1-9.
Wauthier, F. L. , Jojic, N. , & Jordan, M. I. (2013). A comparative framework for preconditioned lasso algorithms. Advances in Neural Information Processing Systems, 26, 1061-1069.
Xu, S. , Yin, X. , Li, S. , Jin, W. , Lou, H. , Yang, L. , Gong, X. , Wang, H. , Shen, Y. , Pan, X. , & He, Y. (2009). Genomic dissection of population substructure of han chinese and its implication in association studies. The American Journal of Human Genetics, 85, 762-774.
Yang, J. , Benyamin, B. , McEvoy, B. P. , Gordon, S. , Henders, A. K. , Nyholt, D. R. , Madden, P. A. , Heath, A. C. , Martin, N. G. , Montgomery, G. W. , Goddard, M. E. (2010). Common SNPs explain a large proportion of the heritability for human height. Nature Genetics, 42, 565.
Yang, J. , Lee, S. H. , Goddard, M. E. , & Visscher, P. M. (2011). GCTA: A tool for genome-wide complex trait analysis. The American Journal of Human Genetics, 88, 76-82.
Yang, J. , Zaitlen, N. A. , Goddard, M. E. , Visscher, P. M. , & Price, A. L. (2014). Advantages and pitfalls in the application of mixed-model association methods. Nature genetics, 46, 100-106.
Yu, J. , Pressoir, G. , Briggs, W. H. , Bi, I. V. , Yamasaki, M. , Doebley, J. F. , McMullen, M. D. , Gaut, B. S. , Nielsen, D. M. , Holland, J. B. , & Kresovich, S. (2006). A unified mixed-model method for association mapping that accounts for multiple levels of relatedness. Nature Genetics, 38, 203.
Zhang, Y. , & Pan, W. (2015). Principal component regression and linear mixed model in association analysis of structured samples: Competitors or complements? Genetic Epidemiology, 39, 149-155.
Zhao, H. , Mitra, N. , Kanetsky, P. A. , Nathanson, K. L. , & Rebbeck, T. R. (2018). A practical approach to adjusting for population stratification in genome-wide association studies: Principal components and propensity scores (PCAPS). Statistical Applications in Genetics and Molecular Biology, 17(6).
Zhao, K. , Aranzana, M. J. , Kim, S. , Lister, C. , Shindo, C. , Tang, C. , Toomajian, C. , Zheng, H. , Dean, C. , Marjoram, P. , & Nordborg, M. (2007). An arabidopsis example of association mapping in structured samples. PLOS Genetics, 3, e4.