Adaptive predictor-set linear model: An imputation-free method for linear regression prediction on data sets with missing values.

epigenetic aging clock linear regression missing values predictive modeling privacy

Journal

Biometrical journal. Biometrische Zeitschrift
ISSN: 1521-4036
Titre abrégé: Biom J
Pays: Germany
ID NLM: 7708048

Informations de publication

Date de publication:
Jun 2024
Historique:
revised: 25 03 2024
received: 24 03 2023
accepted: 01 04 2024
medline: 30 5 2024
pubmed: 30 5 2024
entrez: 30 5 2024
Statut: ppublish

Résumé

Linear regression (LR) is vastly used in data analysis for continuous outcomes in biomedicine and epidemiology. Despite its popularity, LR is incompatible with missing data, which frequently occur in health sciences. For parameter estimation, this shortcoming is usually resolved by complete-case analysis or imputation. Both work-arounds, however, are inadequate for prediction, since they either fail to predict on incomplete records or ignore missingness-induced reduction in prediction accuracy and rely on (unrealistic) assumptions about the missing mechanism. Here, we derive adaptive predictor-set linear model (aps-lm), capable of making predictions for incomplete data without the need for imputation. It is derived by using a predictor-selection operation, the Moore-Penrose pseudoinverse, and the reduced QR decomposition. aps-lm is an LR generalization that inherently handles missing values. It is applied on a reference data set, where complete predictors and outcome are available, and yields a set of privacy-preserving parameters. In a second stage, these are shared for making predictions of the outcome on external data sets with missing entries for predictors without imputation. Moreover, aps-lm computes prediction errors that account for the pattern of missing values even under extreme missingness. We benchmark aps-lm in a simulation study. aps-lm showed greater prediction accuracy and reduced bias compared to popular imputation strategies under a wide range of scenarios including variation of sample size, goodness of fit, missing value type, and covariance structure. Finally, as a proof-of-principle, we apply aps-lm in the context of epigenetic aging clocks, linear models that predict a person's biological age from epigenetic data with promising clinical applications.

Identifiants

pubmed: 38813859
doi: 10.1002/bimj.202300090
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

e2300090

Subventions

Organisme : Erasmus Universitair Medisch Centrum Rotterdam
Organisme : Deutsche Forschungsgemeinschaft
ID : FOR2488

Informations de copyright

© 2024 The Author(s). Biometrical Journal published by Wiley‐VCH GmbH.

Références

Bae, W., & Kim, M. (2014). The general linear test in the ridge regression. Communications for Statistical Applications and Methods, 21(4), 297–307. https://doi.org/10.5351/csam.2014.21.4.297
Bell, C. G., Lowe, R., Adams, P. D., Baccarelli, A. A., Beck, S., Bell, J. T., Christensen, B. C., Gladyshev, V. N., Heijmans, B. T., Horvath, S., Ideker, T., Issa, J.‐P. J., Kelsey, K. T., Marioni, R. E., Reik, W., Relton, C. L., Schalkwyk, L. C., Teschendorff, A. E., Wagner, W., … Rakyan, V. K. (2019). DNA methylation aging clocks: Challenges and recommendations. Genome Biology, 20(1), 249. https://doi.org/10.1186/s13059‐019‐1824‐y
Desai, M., Kubo, J., Esserman, D., & Terry, M. B. (2011). The handling of missing data in molecular epidemiology studies. Cancer Epidemiology, Biomarkers & Prevention, 20(8), 1571–1579. https://doi.org/10.1158/1055‐9965.EPI‐10‐1311
Di Lena, P., Sala, C., & Nardini, C. (2021). Estimage: A webserver hub for the computation of methylation age. Nucleic Acids Research, 49(W1), W199–W206. https://doi.org/10.1093/nar/gkab426
Di Lena, P., Sala, C., Prodi, A., & Nardini, C. (2019). Missing value estimation methods for DNA methylation data. Bioinformatics, 35(19), 3786–3793. https://doi.org/10.1093/bioinformatics/btz134
Friedman, J. H., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1–22. https://doi.org/10.18637/jss.v033.i01
Golub, G., & Kahan, W. (1965). Calculating the singular values and pseudo‐inverse of a matrix. Journal of the Society for Industrial and Applied Mathematics: Series B, Numerical Analysis, 2(2), 205–224. https://doi.org/10.1137/0702016
Goodnight, J. H. (1979). A tutorial on the SWEEP operator. American Statistician, 33(3), 149–158. https://doi.org/10.2307/2683825
Greville, T. N. E. (1966). Note on the generalized inverse of a matrix product. SIAM Review, 8(4), 518–521. https://doi.org/10.1137/1008107
Hamm, K., & Huang, L. (2019). On column‐row matrix approximations. In 13th International Conference on Sampling Theory and Applications (pp. 1–4). IEEE.
Hoerl, A. E., & Kennard, R. W. (2000). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 42(1), 80–86. https://doi.org/10.2307/1271436
Hoogland, J., van Barreveld, M., Debray, T. P. A., Reitsma, J. B., Verstraelen, T. E., Dijkgraaf, M. G. W., & Zwinderman, A. H. (2020). Handling missing predictor values when validating and applying a prediction model to new patients. Statistics in Medicine, 39(25), 3591–3607. https://doi.org/10.1002/sim.8682
Horvath, S. (2013). DNA methylation age of human tissues and cell types. Genome Biology, 14(10), 3156. https://doi.org/10.1186/gb‐2013‐14‐10‐r115
Horvath, S., Pirazzini, C., Bacalini, M. G., Gentilini, D., Di Blasio, A. M., Delledonne, M., Mari, D., Arosio, B., Monti, D., Passarino, G., De Rango, F., D'Aquila, P., Giuliani, C., Marasco, E., Collino, S., Descombes, P., Garagnani, P., & Franceschi, C. (2015). Decreased epigenetic age of PBMCs from Italian semi‐supercentenarians and their offspring. Aging, 7(12), 1159–1170. https://doi.org/10.18632/aging.100861
Horvath, S., & Raj, K. (2018). DNA methylation‐based biomarkers and the epigenetic clock theory of ageing. Nature Reviews Genetics, 19(6), 371–384. https://doi.org/10.1038/s41576‐018‐0004‐3
Janson, L., Fithian, W., & Hastie, T. J. (2015). Effective degrees of freedom: A flawed metaphor. Biometrika, 102(2), 479–485. https://doi.org/10.1093/biomet/asv019
Janssen, K. J. M., Vergouwe, Y., Donders, A. R. T., Harrell, F. E., Jr, Chen, Q., Grobbee, D. E., & Moons, K. G. M. (2009). Dealing with missing predictor values when applying clinical prediction models. Clinical Chemistry, 55(5), 994–1001. https://doi.org/10.1373/clinchem.2008.115345
Kandimalla, R., Xu, J., Link, A., Matsuyama, T., Yamamura, K., Parker, M. I., Uetake, H., Balaguer, F., Borazanci, E., Tsai, S., Evans, D., Meltzer, S. J., Baba, H., Brand, R., Von Hoff, D., Li, W., & Goel, A. (2021). EpiPanGI Dx: A cell‐free DNA methylation fingerprint for the early detection of gastrointestinal cancers. Clinical Cancer Research, 27(22), 6135–6144. https://doi.org/10.1158/1078‐0432.CCR‐21‐1982
Li, C. (2013). Little's test of missing completely at random. Stata Journal, 13(4), 795–809. https://doi.org/10.1177/1536867X1301300407
Little, R. J. A. (1988). A test of missing completely at random for multivariate data with missing values. Journal of the American Statistical Association, 83(404), 1198–1202. https://doi.org/10.1080/01621459.1988.10478722
Little, R. J. A. (1992). Regression with missing X's: A review. Journal of the American Statistical Association, 87(420), 1227–1237. https://doi.org/10.2307/2290664
Maden, S. K., Thompson, R. F., Hansen, K. D., & Nellore, A. (2021). Human methylome variation across Infinium 450K data on the Gene Expression Omnibus. NAR Genomics and Bioinformatics, 3(2). https://doi.org/10.1093/nargab/lqab025
Malin, B. A., Emam, K. E., & O'Keefe, C. M. (2013). Biomedical data privacy: Problems, perspectives, and recent advances. Journal of the American Medical Informatics Association, 20(1), 2–6. https://doi.org/10.1136/amiajnl‐2012‐001509
Marshall, G., Warner, B., MaWhinney, S., & Hammermeister, K. (2002). Prospective prediction in the presence of missing data. Statistics in Medicine, 21(4), 561–570. https://doi.org/10.1002/sim.966
Martínez‐Plumed, F., Ferri, C., Nieves, D., & Hernández‐Orallo, J. (2021). Missing the missing values: The ugly duckling of fairness in machine learning. International Journal of Intelligent Systems, 36(7), 3217–3258. https://doi.org/10.1002/int.22415
McCaw, Z. R., Julienne, H., & Aschard, H. (2020). MGMM: An R package for fitting Gaussian mixture models on incomplete data. bioRxiv. https://doi.org/10.1101/2019.12.20.884551
McCombe, N., Liu, S., Ding, X., Prasad, G., Bucholc, M., Finn, D., Todd, S., Mcclean, P. L., & Wong‐Lin, K. (2022). Practical strategies for extreme missing data imputation in dementia diagnosis. IEEE Journal of Biomedical and Health Informatics, 26(2), 818–827. https://doi.org/10.1109/jbhi.2021.3098511
Mercaldo, S. F., & Blume, J. D. (2020). Missing data and prediction: The pattern submodel. Biostatistics, 21(2), 236–252. https://doi.org/10.1093/biostatistics/kxy040
Morris, T. P., White, I. R., & Crowther, M. J. (2019). Using simulation studies to evaluate statistical methods. Statistics in Medicine, 38(11), 2074–2102. https://doi.org/10.1002/sim.8086
Piironen, J., & Vehtari, A. (2017). Comparison of Bayesian predictive methods for model selection. Statistics and Computing, 27(3), 711–735. https://doi.org/10.1007/s11222‐016‐9649‐y
Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581–592. https://doi.org/10.1093/biomet/63.3.581
Schmidt, A. F., & Finan, C. (2018). Linear regression and the normality assumption. Journal of Clinical Epidemiology, 98, 146–151. https://doi.org/10.1016/j.jclinepi.2017.12.006
Shinozaki, N., Sibuya, M., & Tanabe, K. (1972). Numerical algorithms for the Moore‐Penrose inverse of a matrix: Direct methods. Annals of the Institute of Statistical Mathematics, 24(1), 193–203. https://doi.org/10.1007/bf02479751
Sperrin, M., Martin, G. P., Sisk, R., & Peek, N. (2020). Missing data should be handled differently for prediction than for description or causal explanation. Journal of Clinical Epidemiology, 125, 183–187. https://doi.org/10.1016/j.jclinepi.2020.03.028
Sterne, J. A. C., White, I. R., Carlin, J. B., Spratt, M., Royston, P., Kenward, M. G., Wood, A. M., & Carpenter, J. R. (2009). Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ, 338, b2393. https://doi.org/10.1136/bmj.b2393
van Buuren, S., & Groothuis‐Oudshoorn, K. (2011). mice: Multivariate imputation by chained equations in R. Journal of Statistical Software, 45(3), 1–67. https://doi.org/10.18637/jss.v045.i03
Vidaki, A., & Kayser, M. (2017). From forensic epigenetics to forensic epigenomics: Broadening DNA investigative intelligence. Genome Biology, 18(1), 238. https://doi.org/10.1186/s13059‐017‐1373‐1
Wood, A. M., Royston, P., & White, I. R. (2015). The estimation and use of predictions for the assessment of model performance using large samples with multiply imputed data. Biometrical Journal, 57(4), 614–632. https://doi.org/10.1002/bimj.201400004
Xiong, Z., Li, M., Yang, F., Ma, Y., Sang, J., Li, R., Li, Z., Zhang, Z., & Bao, Y. (2020). EWAS Data Hub: A resource of DNA methylation array data and metadata. Nucleic Acids Research, 48(D1), D890–D895. https://doi.org/10.1093/nar/gkz840
Zhur, K. V., Trifonov, V. A., & Prokhortchouk, E. B. (2021). Progress and prospects in epigenetic studies of ancient DNA. Biochemistry (Moscow), 86(12), 1563–1571. https://doi.org/10.1134/s0006297921120051

Auteurs

Benjamin Planterose Jiménez (B)

Department of Genetic Identification, Erasmus MC, University Medical Center Rotterdam, Rotterdam, the Netherlands.

Manfred Kayser (M)

Department of Genetic Identification, Erasmus MC, University Medical Center Rotterdam, Rotterdam, the Netherlands.

Athina Vidaki (A)

Department of Genetic Identification, Erasmus MC, University Medical Center Rotterdam, Rotterdam, the Netherlands.

Amke Caliebe (A)

Institute of Medical Informatics and Statistics, Kiel University, Kiel, Germany.
University Medical Centre Schleswig-Holstein, Kiel, Germany.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH