Statistical learning for sparser fine-mapped polygenic models: The prediction of LDL-cholesterol.


Journal

Genetic epidemiology
ISSN: 1098-2272
Titre abrégé: Genet Epidemiol
Pays: United States
ID NLM: 8411723

Informations de publication

Date de publication:
12 2022
Historique:
revised: 03 06 2022
received: 24 08 2021
accepted: 07 06 2022
pubmed: 9 8 2022
medline: 19 11 2022
entrez: 8 8 2022
Statut: ppublish

Résumé

Polygenic risk scores quantify the individual genetic predisposition regarding a particular trait. We propose and illustrate the application of existing statistical learning methods to derive sparser models for genome-wide data with a polygenic signal. Our approach is based on three consecutive steps. First, potentially informative loci are identified by a marginal screening approach. Then, fine-mapping is independently applied for blocks of variants in linkage disequilibrium, where informative variants are retrieved by using variable selection methods including boosting with probing and stochastic searches with the Adaptive Subspace method. Finally, joint prediction models with the selected variants are derived using statistical boosting. In contrast to alternative approaches relying on univariate summary statistics from genome-wide association studies, our three-step approach enables to select and fit multivariable regression models on large-scale genotype data. Based on UK Biobank data, we develop prediction models for LDL-cholesterol as a continuous trait. Additionally, we consider a recent scalable algorithm for the Lasso. Results show that statistical learning approaches based on fine-mapping of genetic signals result in a competitive prediction performance compared to classical polygenic risk approaches, while yielding sparser risk models.

Identifiants

pubmed: 35938382
doi: 10.1002/gepi.22495
doi:

Substances chimiques

Cholesterol, LDL 0

Types de publication

Journal Article Research Support, Non-U.S. Gov't

Langues

eng

Sous-ensembles de citation

IM

Pagination

589-603

Informations de copyright

© 2022 The Authors. Genetic Epidemiology published by Wiley Periodicals LLC.

Références

Ardlie, K. G., Kruglyak, L., & Seielstad, M. (2002). Patterns of linkage disequilibrium in the human genome. Nature Reviews Genetics, 3(4), 299-309.
Benner, C., Spencer, C. C., Havulinna, A. S., Salomaa, V., Ripatti, S., & Pirinen, M. (2016). Finemap: Efficient variable selection using summary data from genome-wide association studies. Bioinformatics, 32(10), 1493-1501.
Berisa, T., & Pickrell, J. K. (2016). Approximately independent linkage disequilibrium blocks in human populations. Bioinformatics, 32(2), 283-285.
Bühlmann, P., & Hothorn, T. (2007). Boosting algorithms: Regularization, prediction and model fitting (with discussion). Statistical Science, 22, 477-522.
Bühlmann, P., & Yu, B. (2003). Boosting with the L2 loss: Regression and classification. Journal of the American Statistical Association, 98(462), 324-339.
Bycroft, C., Freeman, C., Petkova, D., Band, G., Elliott, L. T., Sharp, K., Motyer, A., Vukcevic, D., Delaneau, O., O'Connell, J., Cortes, A., Welsh, S., Young, A., Effingham, M., McVean, G., Leslie, S., Allen, N., Donnelly, P., & Marchini, J. (2018). The UK biobank resource with deep phenotyping and genomic data. Nature, 562(7726), 203-209.
Chang, C. C., Chow, C. C., Tellier, L. C., Vattikuti, S., Purcell, S. M., & Lee, J. J. (2015). Second-generation plink: Rising to the challenge of larger and richer datasets. Gigascience, 4(1), s13742-015.
Chen, J., & Chen, Z. (2008). Extended Bayesian information criteria for model selection with large model spaces. Biometrika, 95(3), 759-771.
Chen, T., & Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp 785-794).
Choi, S. W., Mak, T. S.-H., & O'Reilly, P. F. (2020). Tutorial: A guide to performing polygenic risk score analyses. Nature Protocols, 15(9), 2759-2772.
Curtis, D. (2018). Polygenic risk score for schizophrenia is more strongly associated with ancestry than with schizophrenia. Psychiatric Genetics, 28(5), 85-89.
Duncan, L., Shen, H., Gelaye, B., Meijsen, J., Ressler, K., Feldman, M., Peterson, R., & Domingue, B. (2019). Analysis of polygenic risk score usage and performance in diverse human populations. Nature Communications, 10(1), 1-9.
Euesden, J., Lewis, C. M., & O'Reilly, P. F. (2015). PRSice: Polygenic risk score software. Bioinformatics, 31(9), 1466-1468.
Fahed, A. C., Wang, M., Homburger, J. R., Patel, A. P., Bick, A. G., Neben, C. L., Lai, C., Brockman, D., Philippakis, A., Ellinor, P. T., Cassa, C. A., Lebo, M., Ng, K., Lander, E. S., Zhou, A. Y., Kathiresan, S., & Khera, A. V. (2020). Polygenic background modifies penetrance of monogenic variants for tier 1 genomic conditions. Nature Communications, 11(1), 1-9.
Fan, J., & Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(5), 849-911.
Freund, Y., & Schapire, R. (1996). Experiments with a new boosting algorithm. In Proceedings of the Thirteenth International Conference on Machine Learning Theory (pp 148-156). Morgan Kaufmann Publishers Inc.
Furnival, G. M., & Wilson, R. W. (2000). Regressions by leaps and bounds. Technometrics, 42(1), 69-79.
Ge, T., Chen, C.-Y., Ni, Y., Feng, Y.-C. A., & Smoller, J. W. (2019). Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nature Communications, 10(1), 1-10.
Grinde, K. E., Qi, Q., Thornton, T. A., Liu, S., Shadyab, A. H., Chan, K. H. K., Reiner, A. P., & Sofer, T. (2019). Generalizing polygenic risk scores from Europeans to Hispanics/Latinos. Genetic Epidemiology, 43(1), 50-62.
Gurdasani, D., Carstensen, T., Fatumo, S., Chen, G., Franklin, C. S., Prado-Martinez, J., Bouman, H., Abascal, F., Haber, M., Tachmazidou, I., Mathieson, I., Ekoru, K., DeGorter, M. K., Nsubuga, R. N., Finan, C., Wheeler, E., Chen, L., Cooper, D. N., Schiffels, S., … Sandhu, M. S. (2019). Uganda genome resource enables insights into population history and genomic discovery in Africa. Cell, 179(4), 984-1002.
Hepp, T., Schmid, M., Gefeller, O., Waldmann, E., & Mayr, A. (2016). Approaches to regularized regression-A comparison between gradient boosting and the lasso. Methods of Information in Medicine, 55(5), 422-430.
Hoffman, G. E., Logsdon, B. A., & Mezey, J. G. (2013). Puma: A unified framework for penalized multiple regression analysis of GWAS data. PLoS Computational Biology, 9(6), e1003101.
Hofner, B., Mayr, A., Robinzonov, N., & Schmid, M. (2014). Model-based boosting in R: A hands-on tutorial using the R package mboost. Computational Statistics, 29(1), 3-35.
Ji, Y., Long, J., Kweon, S.-S., Kang, D., Kubo, M., Park, B., Shu, X.-O., Zheng, W., Tao, R., & Li, B. (2021). Incorporating European GWAS findings improve polygenic risk prediction accuracy of breast cancer among East Asians. Genetic Epidemiology, 45(5), 471-484.
Kachuri, L., Graff, R. E., Smith-Byrne, K., Meyers, T. J., Rashkin, S. R., Ziv, E., Witte, J. S., & Johansson, M. (2020). Pan-cancer analysis demonstrates that integrating polygenic risk scores with modifiable risk factors improves risk prediction. Nature Communications, 11(1), 1-11.
Khera, A. V., Chaffin, M., Aragam, K. G., Haas, M. E., Roselli, C., Choi, S. H., Natarajan, P., Lander, E. S., Lubitz, S. A., Ellinor, P. T., & Kathiresan, S. (2018). Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nature Genetics, 50(9), 1219-1224.
Khera, A. V., Chaffin, M., Zekavat, S. M., Collins, R. L., Roselli, C., Natarajan, P., Lichtman, J. H., D'Onofrio, G., Mattera, J., Dreyer, R., Spertus, J. A., Taylor, K. D., Psaty, B. M., Rich, S. S., Post, W., Gupta, N., Gabriel, S., Lander, E., Chen, Y.-D. I., … Kathiresan, S. (2019). Whole-genome sequencing to characterize monogenic and polygenic contributions in patients hospitalized with early-onset myocardial infarction. Circulation, 139(13), 1593-1602.
Lello, L., Avery, S. G., Tellier, L., Vazquez, A. I., de Los Campos, G., & Hsu, S. D. (2018). Accurate genomic prediction of human height. Genetics, 210(2), 477-497.
Lewis, A. C., & Green, R. C. (2021). Polygenic risk scores in the clinic: New perspectives needed on familiar ethical issues. Genome Medicine, 13(1), 1-10.
Li, Y. R., & Keating, B. J. (2014). Trans-ethnic genome-wide association studies: Advantages and challenges of mapping in diverse populations. Genome Medicine, 6(10), 1-14.
Locke, A. E., Steinberg, K. M., Chiang, C. W., Service, S. K., Havulinna, A. S., Stell, L., Pirinen, M., Abel, H. J., Chiang, C. C., Fulton, R. S., Jackson, A. U., Kang, C. J., Kanchi, K. L., Koboldt, D. C., Larson, D., Nelson, J., Nicholas, T. J., Pietilä, A., Ramensky, V., … Freimer, N. B. (2019). Exome sequencing of Finnish isolates enhances rare-variant association power. Nature, 572(7769), 323-328.
Mak, T. S. H., Porsch, R. M., Choi, S. W., Zhou, X., & Sham, P. C. (2017). Polygenic scores via penalized regression on summary statistics. Genetic Epidemiology, 41(6), 469-480.
Mars, N., Kerminen, S., Feng, Y.-C. A., Kanai, M., Läll, K., Thomas, L. F., Skogholt, A. H., della Briotta Parolo, P., Neale, B. M., Smoller, J. W., Gabrielsen, M. E., Hveem, K., Mägi, R., Matsuda, K., Okada, Y., Pirinen, M., Palotie, A., Ganna, A., Martin, A. R., & Ripatti, S. (2022). Genome-wide risk prediction of common diseases across ancestries in one million people. Cell Genomics, 2(4), 100118.
Matoba, N., Akiyama, M., Ishigaki, K., Kanai, M., Takahashi, A., Momozawa, Y., Ikegawa, S., Ikeda, M., Iwata, N., Hirata, M., Matsuda, K., Murakami, Y., Kubo, M., Kamatani, Y., & Okada, Y. (2020). GWAS of 165,084 Japanese individuals identified nine loci associated with dietary habits. Nature Human Behaviour, 4(3), 308-316.
Mayr, A., Binder, H., Gefeller, O., & Schmid, M. (2014). The evolution of boosting algorithms-From machine learning to statistical modelling. Methods of Information in Medicine, 53(6), 419-427.
McLaren, W., Gil, L., Hunt, S. E., Riat, H. S., Ritchie, G. R., Thormann, A., Flicek, P., & Cunningham, F. (2016). The ensemble variant effect predictor. Genome Biology, 17(1), 1-14.
Nielsen, J. B., Rom, O., Surakka, I., Graham, S. E., Zhou, W., Roychowdhury, T., Fritsche, L. G., Taliun, S. A. G., Sidore, C., Liu, Y., Gabrielsen, M. E., Skogholt, A. H., Wolford, B., Overton, W., Zhao, Y., Chen, J., Zhang, H., Hornsby, W. E., Acheampong, A., … Hveem, K. (2020). Loss-of-function genomic variants highlight potential therapeutic targets for cardiovascular disease. Nature Communications, 11(1), 1-12.
Pattee, J., & Pan, W. (2020). Penalized regression and model selection methods for polygenic scores on summary statistics. PLoS Computational Biology, 16(10), e1008271.
Privé, F. (2020). Ancestry inference and grouping from principal component analysis of genetic data. bioRxiv. https://www.biorxiv.org/content/early/2020/10/26/2020.10.06.328203, https://doi.org/10.1101/2020.10.06.328203
Privé, F., Arbel, J., & Vilhjálmsson, B. J. (2020). Ldpred2: Better, faster, stronger. Bioinformatics, 36(22-23), 5424-5431.
Privé, F., Aschard, H., Ziyatdinov, A., & Blum, M. G. (2018). Efficient analysis of large-scale genome-wide data with two R packages: Bigstatsr and bigsnpr. Bioinformatics, 34(16), 2781-2787.
Qian, J., Tanigawa, Y., Du, W., Aguirre, M., Chang, C., Tibshirani, R., Rivas, M. A., & Hastie, T. (2020). A fast and scalable framework for large-scale and ultrahigh-dimensional sparse regression with application to the UK biobank. PLoS Genetics, 16(10), e1009141.
Ripatti, P., Rämö, J. T., Mars, N. J., Fu, Y., Lin, J., Söderlund, S., Benner, C., Surakka, I., Kiiskinen, T., Havulinna, A. S., Palta, P., Freimer, N. B., Widen, E., Salomaa, V., Tukiainen, T., Pirinen, M., Palotie, A., Taskinen, M. R., Ripatti, S., & FinnGen. (2020). Polygenic hyperlipidemias and coronary artery disease risk. Circulation: Genomic and Precision Medicine, 13(4), e002725.
Ruan, Y., Lin, Y.-F., Feng, Y.-C. A., Chen, C.-Y., Lam, M., Guo, Z., He, L., Sawa, A., Martin, A. R., Qin, S., Huang, H., & Ge, T. (2022). Improving polygenic prediction in ancestrally diverse populations. Nature Genetics, 54, 573-580.
Sabatine, M. S. (2019). Pcsk9 inhibitors: clinical evidence and implementation. Nature Reviews Cardiology, 16(3), 155-165.
Sinnott-Armstrong, N., Tanigawa, Y., Amar, D., Mars, N., Benner, C., Aguirre, M., Venkataraman, G. R., Wainberg, M., Ollila, H. M., Kiiskinen, T., Havulinna, A. S., Pirruccello, J. P., Qian, J., Shcherbina, A., Rodriguez, F., Assimes, T. L., Agarwala, V., Tibshirani, R., Hastie, T., … FinnGen (2021). Genetics of 35 blood and urine biomarkers in the UK biobank. Nature Genetics, 53(2), 185-194.
Staerk, C., Kateri, M., & Ntzoufras, I. (2021). High-dimensional variable selection via low-dimensional adaptive learning. Electronic Journal of Statistics, 15(1), 830-879.
Tanigawa, Y., Qian, J., Venkataraman, G., Justesen, J. M., Li, R., Tibshirani, R., Hastie, T., & Rivas, M. A. (2022). Significant sparse polygenic risk scores across 813 traits in UK Biobank. medRxiv. https://www.medrxiv.org/content/early/2022/01/27/2021.09.02.21262942, https://doi.org/10.1101/2021.09.02.21262942
Thomas, J., Hepp, T., Mayr, A., & Bischl, B. (2017). Probing for sparse and fast variable selection with model-based boosting. Computational and Mathematical Methods in Medicine, 2017, 1421409.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267-288.
Vilhjálmsson, B. J., Yang, J., Finucane, H. K., Gusev, A., Lindström, S., Ripke, S., Genovese, G., Loh, P.-R., Bhatia, G., Do, R., Hayeck, T., Won, H. H., Schizophrenia Working Group of the Psychiatric Genomics Consortium, Discovery, Biology, and Risk of Inherited Variants in Breast Cancer (DRIVE) study, Kathiresan, S., Pato, M., Pato, C., Tamimi, R., Stahl, E., Zaitlen, N., … Price, A. L. (2015). Modeling linkage disequilibrium increases accuracy of polygenic risk scores. The American Journal of Human Genetics, 97(4), 576-592.
Wand, H., Lambert, S. A., Tamburro, C., Iacocca, M. A., O'Sullivan, J. W., Sillari, C., Kullo, I. J., Rowley, R., Dron, J. S., Brockman, D., Venner, E., McCarthy, M. I., Antoniou, A. C., Easton, D. F., Hegele, R. A., Khera, A. V., Chatterjee, N., Kooperberg, C., Edwards, K., … Wojcik, G. L. (2021). Improving reporting standards for polygenic scores in risk prediction studies. Nature, 591(7849), 211-219.
Wang, Y., Guo, J., Ni, G., Yang, J., Visscher, P. M., & Yengo, L. (2020). Theoretical and empirical quantification of the accuracy of polygenic scores in ancestry divergent populations. Nature Communications, 11(1), 1-9.
Weissbrod, O., Hormozdiari, F., Benner, C., Cui, R., Ulirsch, J., Gazal, S., Schoech, A. P., Van De Geijn, B., Reshef, Y., Márquez-Luna, C., O'Connor, L., Pirinen, M., Finucane, H. K., & Price, A. L. (2020). Functionally informed fine-mapping and polygenic localization of complex trait heritability. Nature Genetics, 52(12), 1355-1363.
Weissbrod, O., Kanai, M., Shi, H., Gazal, S., Peyrot, W. J., Khera, A. V., Okada, Y., Martin, A. R., Finucane, H. K., & Price, A. L. (2022). Leveraging fine-mapping and multipopulation training data to improve cross-population polygenic risk scores. Nature Genetics, 54, 450-458.
Wu, Y., Boos, D. D., & Stefanski, L. A. (2007). Controlling variable selection by the addition of pseudovariables. Journal of the American Statistical Association, 102(477), 235-243.

Auteurs

Carlo Maj (C)

Institute for Genomic Statistics and Bioinformatics, Medical Faculty, University Bonn, Bonn, Germany.
Centre for Human Genetics, University of Marburg, Marburg, Germany.

Christian Staerk (C)

Institute for Medical Biometry, Informatics and Epidemiology, Medical Faculty, University Bonn, Bonn, Germany.

Oleg Borisov (O)

Institute for Genomic Statistics and Bioinformatics, Medical Faculty, University Bonn, Bonn, Germany.

Hannah Klinkhammer (H)

Institute for Genomic Statistics and Bioinformatics, Medical Faculty, University Bonn, Bonn, Germany.
Institute for Medical Biometry, Informatics and Epidemiology, Medical Faculty, University Bonn, Bonn, Germany.

Ming Wai Yeung (M)

Institute for Genomic Statistics and Bioinformatics, Medical Faculty, University Bonn, Bonn, Germany.
Department of Cardiology, University of Groningen, Groningen, The Netherlands.

Peter Krawitz (P)

Institute for Genomic Statistics and Bioinformatics, Medical Faculty, University Bonn, Bonn, Germany.

Andreas Mayr (A)

Institute for Medical Biometry, Informatics and Epidemiology, Medical Faculty, University Bonn, Bonn, Germany.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH