Statistical learning for sparser fine-mapped polygenic models: The prediction of LDL-cholesterol.

Humans Genome-Wide Association Study / methods Cholesterol, LDL / genetics Polymorphism, Single Nucleotide Models, Genetic Multifactorial Inheritance / genetics

UK Biobank boosting polygenic score stochastic search variable selection

Journal

Genetic epidemiology

ISSN: 1098-2272

Titre abrégé: Genet Epidemiol

Pays: United States

ID NLM: 8411723

Informations de publication

Date de publication:
12 2022

Historique:

revised: 03 06 2022

received: 24 08 2021

accepted: 07 06 2022

pubmed: 9 8 2022

medline: 19 11 2022

entrez: 8 8 2022

Statut: ppublish

Résumé

Polygenic risk scores quantify the individual genetic predisposition regarding a particular trait. We propose and illustrate the application of existing statistical learning methods to derive sparser models for genome-wide data with a polygenic signal. Our approach is based on three consecutive steps. First, potentially informative loci are identified by a marginal screening approach. Then, fine-mapping is independently applied for blocks of variants in linkage disequilibrium, where informative variants are retrieved by using variable selection methods including boosting with probing and stochastic searches with the Adaptive Subspace method. Finally, joint prediction models with the selected variants are derived using statistical boosting. In contrast to alternative approaches relying on univariate summary statistics from genome-wide association studies, our three-step approach enables to select and fit multivariable regression models on large-scale genotype data. Based on UK Biobank data, we develop prediction models for LDL-cholesterol as a continuous trait. Additionally, we consider a recent scalable algorithm for the Lasso. Results show that statistical learning approaches based on fine-mapping of genetic signals result in a competitive prediction performance compared to classical polygenic risk approaches, while yielding sparser risk models.

Identifiants

DOI: 10.1002/gepi.22495 PMID: 35938382

pubmed: 35938382

doi: 10.1002/gepi.22495

doi:

Substances chimiques

Cholesterol, LDL 0

Types de publication

Journal Article Research Support, Non-U.S. Gov't

Langues

eng

Sous-ensembles de citation

Pagination

589-603

Informations de copyright

Références

Ardlie, K. G., Kruglyak, L., & Seielstad, M. (2002). Patterns of linkage disequilibrium in the human genome. Nature Reviews Genetics, 3(4), 299-309.

Benner, C., Spencer, C. C., Havulinna, A. S., Salomaa, V., Ripatti, S., & Pirinen, M. (2016). Finemap: Efficient variable selection using summary data from genome-wide association studies. Bioinformatics, 32(10), 1493-1501.

Berisa, T., & Pickrell, J. K. (2016). Approximately independent linkage disequilibrium blocks in human populations. Bioinformatics, 32(2), 283-285.

Bühlmann, P., & Hothorn, T. (2007). Boosting algorithms: Regularization, prediction and model fitting (with discussion). Statistical Science, 22, 477-522.

Bühlmann, P., & Yu, B. (2003). Boosting with the L2 loss: Regression and classification. Journal of the American Statistical Association, 98(462), 324-339.

Bycroft, C., Freeman, C., Petkova, D., Band, G., Elliott, L. T., Sharp, K., Motyer, A., Vukcevic, D., Delaneau, O., O'Connell, J., Cortes, A., Welsh, S., Young, A., Effingham, M., McVean, G., Leslie, S., Allen, N., Donnelly, P., & Marchini, J. (2018). The UK biobank resource with deep phenotyping and genomic data. Nature, 562(7726), 203-209.

Chang, C. C., Chow, C. C., Tellier, L. C., Vattikuti, S., Purcell, S. M., & Lee, J. J. (2015). Second-generation plink: Rising to the challenge of larger and richer datasets. Gigascience, 4(1), s13742-015.

Chen, J., & Chen, Z. (2008). Extended Bayesian information criteria for model selection with large model spaces. Biometrika, 95(3), 759-771.

Chen, T., & Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp 785-794).

Choi, S. W., Mak, T. S.-H., & O'Reilly, P. F. (2020). Tutorial: A guide to performing polygenic risk score analyses. Nature Protocols, 15(9), 2759-2772.

Curtis, D. (2018). Polygenic risk score for schizophrenia is more strongly associated with ancestry than with schizophrenia. Psychiatric Genetics, 28(5), 85-89.

Duncan, L., Shen, H., Gelaye, B., Meijsen, J., Ressler, K., Feldman, M., Peterson, R., & Domingue, B. (2019). Analysis of polygenic risk score usage and performance in diverse human populations. Nature Communications, 10(1), 1-9.

Euesden, J., Lewis, C. M., & O'Reilly, P. F. (2015). PRSice: Polygenic risk score software. Bioinformatics, 31(9), 1466-1468.

Fahed, A. C., Wang, M., Homburger, J. R., Patel, A. P., Bick, A. G., Neben, C. L., Lai, C., Brockman, D., Philippakis, A., Ellinor, P. T., Cassa, C. A., Lebo, M., Ng, K., Lander, E. S., Zhou, A. Y., Kathiresan, S., & Khera, A. V. (2020). Polygenic background modifies penetrance of monogenic variants for tier 1 genomic conditions. Nature Communications, 11(1), 1-9.

Fan, J., & Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(5), 849-911.

Freund, Y., & Schapire, R. (1996). Experiments with a new boosting algorithm. In Proceedings of the Thirteenth International Conference on Machine Learning Theory (pp 148-156). Morgan Kaufmann Publishers Inc.

Furnival, G. M., & Wilson, R. W. (2000). Regressions by leaps and bounds. Technometrics, 42(1), 69-79.

Ge, T., Chen, C.-Y., Ni, Y., Feng, Y.-C. A., & Smoller, J. W. (2019). Polygenic prediction via Bayesian regression and continuous shrinkage priors. Nature Communications, 10(1), 1-10.

Grinde, K. E., Qi, Q., Thornton, T. A., Liu, S., Shadyab, A. H., Chan, K. H. K., Reiner, A. P., & Sofer, T. (2019). Generalizing polygenic risk scores from Europeans to Hispanics/Latinos. Genetic Epidemiology, 43(1), 50-62.

Gurdasani, D., Carstensen, T., Fatumo, S., Chen, G., Franklin, C. S., Prado-Martinez, J., Bouman, H., Abascal, F., Haber, M., Tachmazidou, I., Mathieson, I., Ekoru, K., DeGorter, M. K., Nsubuga, R. N., Finan, C., Wheeler, E., Chen, L., Cooper, D. N., Schiffels, S., … Sandhu, M. S. (2019). Uganda genome resource enables insights into population history and genomic discovery in Africa. Cell, 179(4), 984-1002.

Hepp, T., Schmid, M., Gefeller, O., Waldmann, E., & Mayr, A. (2016). Approaches to regularized regression-A comparison between gradient boosting and the lasso. Methods of Information in Medicine, 55(5), 422-430.

Hoffman, G. E., Logsdon, B. A., & Mezey, J. G. (2013). Puma: A unified framework for penalized multiple regression analysis of GWAS data. PLoS Computational Biology, 9(6), e1003101.

Hofner, B., Mayr, A., Robinzonov, N., & Schmid, M. (2014). Model-based boosting in R: A hands-on tutorial using the R package mboost. Computational Statistics, 29(1), 3-35.

Ji, Y., Long, J., Kweon, S.-S., Kang, D., Kubo, M., Park, B., Shu, X.-O., Zheng, W., Tao, R., & Li, B. (2021). Incorporating European GWAS findings improve polygenic risk prediction accuracy of breast cancer among East Asians. Genetic Epidemiology, 45(5), 471-484.

Kachuri, L., Graff, R. E., Smith-Byrne, K., Meyers, T. J., Rashkin, S. R., Ziv, E., Witte, J. S., & Johansson, M. (2020). Pan-cancer analysis demonstrates that integrating polygenic risk scores with modifiable risk factors improves risk prediction. Nature Communications, 11(1), 1-11.

Khera, A. V., Chaffin, M., Aragam, K. G., Haas, M. E., Roselli, C., Choi, S. H., Natarajan, P., Lander, E. S., Lubitz, S. A., Ellinor, P. T., & Kathiresan, S. (2018). Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nature Genetics, 50(9), 1219-1224.

Khera, A. V., Chaffin, M., Zekavat, S. M., Collins, R. L., Roselli, C., Natarajan, P., Lichtman, J. H., D'Onofrio, G., Mattera, J., Dreyer, R., Spertus, J. A., Taylor, K. D., Psaty, B. M., Rich, S. S., Post, W., Gupta, N., Gabriel, S., Lander, E., Chen, Y.-D. I., … Kathiresan, S. (2019). Whole-genome sequencing to characterize monogenic and polygenic contributions in patients hospitalized with early-onset myocardial infarction. Circulation, 139(13), 1593-1602.

Lello, L., Avery, S. G., Tellier, L., Vazquez, A. I., de Los Campos, G., & Hsu, S. D. (2018). Accurate genomic prediction of human height. Genetics, 210(2), 477-497.

Lewis, A. C., & Green, R. C. (2021). Polygenic risk scores in the clinic: New perspectives needed on familiar ethical issues. Genome Medicine, 13(1), 1-10.

Li, Y. R., & Keating, B. J. (2014). Trans-ethnic genome-wide association studies: Advantages and challenges of mapping in diverse populations. Genome Medicine, 6(10), 1-14.

Locke, A. E., Steinberg, K. M., Chiang, C. W., Service, S. K., Havulinna, A. S., Stell, L., Pirinen, M., Abel, H. J., Chiang, C. C., Fulton, R. S., Jackson, A. U., Kang, C. J., Kanchi, K. L., Koboldt, D. C., Larson, D., Nelson, J., Nicholas, T. J., Pietilä, A., Ramensky, V., … Freimer, N. B. (2019). Exome sequencing of Finnish isolates enhances rare-variant association power. Nature, 572(7769), 323-328.

Mak, T. S. H., Porsch, R. M., Choi, S. W., Zhou, X., & Sham, P. C. (2017). Polygenic scores via penalized regression on summary statistics. Genetic Epidemiology, 41(6), 469-480.

Mars, N., Kerminen, S., Feng, Y.-C. A., Kanai, M., Läll, K., Thomas, L. F., Skogholt, A. H., della Briotta Parolo, P., Neale, B. M., Smoller, J. W., Gabrielsen, M. E., Hveem, K., Mägi, R., Matsuda, K., Okada, Y., Pirinen, M., Palotie, A., Ganna, A., Martin, A. R., & Ripatti, S. (2022). Genome-wide risk prediction of common diseases across ancestries in one million people. Cell Genomics, 2(4), 100118.

Matoba, N., Akiyama, M., Ishigaki, K., Kanai, M., Takahashi, A., Momozawa, Y., Ikegawa, S., Ikeda, M., Iwata, N., Hirata, M., Matsuda, K., Murakami, Y., Kubo, M., Kamatani, Y., & Okada, Y. (2020). GWAS of 165,084 Japanese individuals identified nine loci associated with dietary habits. Nature Human Behaviour, 4(3), 308-316.

Mayr, A., Binder, H., Gefeller, O., & Schmid, M. (2014). The evolution of boosting algorithms-From machine learning to statistical modelling. Methods of Information in Medicine, 53(6), 419-427.

McLaren, W., Gil, L., Hunt, S. E., Riat, H. S., Ritchie, G. R., Thormann, A., Flicek, P., & Cunningham, F. (2016). The ensemble variant effect predictor. Genome Biology, 17(1), 1-14.

Nielsen, J. B., Rom, O., Surakka, I., Graham, S. E., Zhou, W., Roychowdhury, T., Fritsche, L. G., Taliun, S. A. G., Sidore, C., Liu, Y., Gabrielsen, M. E., Skogholt, A. H., Wolford, B., Overton, W., Zhao, Y., Chen, J., Zhang, H., Hornsby, W. E., Acheampong, A., … Hveem, K. (2020). Loss-of-function genomic variants highlight potential therapeutic targets for cardiovascular disease. Nature Communications, 11(1), 1-12.

Pattee, J., & Pan, W. (2020). Penalized regression and model selection methods for polygenic scores on summary statistics. PLoS Computational Biology, 16(10), e1008271.

Privé, F. (2020). Ancestry inference and grouping from principal component analysis of genetic data. bioRxiv. https://www.biorxiv.org/content/early/2020/10/26/2020.10.06.328203, https://doi.org/10.1101/2020.10.06.328203

Privé, F., Arbel, J., & Vilhjálmsson, B. J. (2020). Ldpred2: Better, faster, stronger. Bioinformatics, 36(22-23), 5424-5431.

Privé, F., Aschard, H., Ziyatdinov, A., & Blum, M. G. (2018). Efficient analysis of large-scale genome-wide data with two R packages: Bigstatsr and bigsnpr. Bioinformatics, 34(16), 2781-2787.

Qian, J., Tanigawa, Y., Du, W., Aguirre, M., Chang, C., Tibshirani, R., Rivas, M. A., & Hastie, T. (2020). A fast and scalable framework for large-scale and ultrahigh-dimensional sparse regression with application to the UK biobank. PLoS Genetics, 16(10), e1009141.

Ripatti, P., Rämö, J. T., Mars, N. J., Fu, Y., Lin, J., Söderlund, S., Benner, C., Surakka, I., Kiiskinen, T., Havulinna, A. S., Palta, P., Freimer, N. B., Widen, E., Salomaa, V., Tukiainen, T., Pirinen, M., Palotie, A., Taskinen, M. R., Ripatti, S., & FinnGen. (2020). Polygenic hyperlipidemias and coronary artery disease risk. Circulation: Genomic and Precision Medicine, 13(4), e002725.

Ruan, Y., Lin, Y.-F., Feng, Y.-C. A., Chen, C.-Y., Lam, M., Guo, Z., He, L., Sawa, A., Martin, A. R., Qin, S., Huang, H., & Ge, T. (2022). Improving polygenic prediction in ancestrally diverse populations. Nature Genetics, 54, 573-580.

Sabatine, M. S. (2019). Pcsk9 inhibitors: clinical evidence and implementation. Nature Reviews Cardiology, 16(3), 155-165.

Sinnott-Armstrong, N., Tanigawa, Y., Amar, D., Mars, N., Benner, C., Aguirre, M., Venkataraman, G. R., Wainberg, M., Ollila, H. M., Kiiskinen, T., Havulinna, A. S., Pirruccello, J. P., Qian, J., Shcherbina, A., Rodriguez, F., Assimes, T. L., Agarwala, V., Tibshirani, R., Hastie, T., … FinnGen (2021). Genetics of 35 blood and urine biomarkers in the UK biobank. Nature Genetics, 53(2), 185-194.

Staerk, C., Kateri, M., & Ntzoufras, I. (2021). High-dimensional variable selection via low-dimensional adaptive learning. Electronic Journal of Statistics, 15(1), 830-879.

Tanigawa, Y., Qian, J., Venkataraman, G., Justesen, J. M., Li, R., Tibshirani, R., Hastie, T., & Rivas, M. A. (2022). Significant sparse polygenic risk scores across 813 traits in UK Biobank. medRxiv. https://www.medrxiv.org/content/early/2022/01/27/2021.09.02.21262942, https://doi.org/10.1101/2021.09.02.21262942

Thomas, J., Hepp, T., Mayr, A., & Bischl, B. (2017). Probing for sparse and fast variable selection with model-based boosting. Computational and Mathematical Methods in Medicine, 2017, 1421409.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267-288.

Vilhjálmsson, B. J., Yang, J., Finucane, H. K., Gusev, A., Lindström, S., Ripke, S., Genovese, G., Loh, P.-R., Bhatia, G., Do, R., Hayeck, T., Won, H. H., Schizophrenia Working Group of the Psychiatric Genomics Consortium, Discovery, Biology, and Risk of Inherited Variants in Breast Cancer (DRIVE) study, Kathiresan, S., Pato, M., Pato, C., Tamimi, R., Stahl, E., Zaitlen, N., … Price, A. L. (2015). Modeling linkage disequilibrium increases accuracy of polygenic risk scores. The American Journal of Human Genetics, 97(4), 576-592.

Wand, H., Lambert, S. A., Tamburro, C., Iacocca, M. A., O'Sullivan, J. W., Sillari, C., Kullo, I. J., Rowley, R., Dron, J. S., Brockman, D., Venner, E., McCarthy, M. I., Antoniou, A. C., Easton, D. F., Hegele, R. A., Khera, A. V., Chatterjee, N., Kooperberg, C., Edwards, K., … Wojcik, G. L. (2021). Improving reporting standards for polygenic scores in risk prediction studies. Nature, 591(7849), 211-219.

Wang, Y., Guo, J., Ni, G., Yang, J., Visscher, P. M., & Yengo, L. (2020). Theoretical and empirical quantification of the accuracy of polygenic scores in ancestry divergent populations. Nature Communications, 11(1), 1-9.

Weissbrod, O., Hormozdiari, F., Benner, C., Cui, R., Ulirsch, J., Gazal, S., Schoech, A. P., Van De Geijn, B., Reshef, Y., Márquez-Luna, C., O'Connor, L., Pirinen, M., Finucane, H. K., & Price, A. L. (2020). Functionally informed fine-mapping and polygenic localization of complex trait heritability. Nature Genetics, 52(12), 1355-1363.

Weissbrod, O., Kanai, M., Shi, H., Gazal, S., Peyrot, W. J., Khera, A. V., Okada, Y., Martin, A. R., Finucane, H. K., & Price, A. L. (2022). Leveraging fine-mapping and multipopulation training data to improve cross-population polygenic risk scores. Nature Genetics, 54, 450-458.

Wu, Y., Boos, D. D., & Stefanski, L. A. (2007). Controlling variable selection by the addition of pseudovariables. Journal of the American Statistical Association, 102(477), 235-243.

Statistical learning for sparser fine-mapped polygenic models: The prediction of LDL-cholesterol.

Journal

Informations de publication

Résumé

Identifiants

Substances chimiques

Types de publication

Langues

Sous-ensembles de citation

Pagination

Informations de copyright

Références

Auteurs

Carlo Maj (C)

Christian Staerk (C)

Oleg Borisov (O)

Hannah Klinkhammer (H)

Ming Wai Yeung (M)

Peter Krawitz (P)

Andreas Mayr (A)

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Smoking Cessation and Incident Cardiovascular Disease.

Evaluation of Low-Value Services Across Major Medicare Advantage Insurers and Traditional Medicare.

Effectiveness of Virtual Yoga for Chronic Low Back Pain: A Randomized Clinical Trial.

Classifications MeSH