Adaptive predictor-set linear model: An imputation-free method for linear regression prediction on data sets with missing values.

Linear Models Biometry / methods Humans

epigenetic aging clock linear regression missing values predictive modeling privacy

Journal

Biometrical journal. Biometrische Zeitschrift

ISSN: 1521-4036

Titre abrégé: Biom J

Pays: Germany

ID NLM: 7708048

Informations de publication

Date de publication:
Jun 2024

Historique:

revised: 25 03 2024

received: 24 03 2023

accepted: 01 04 2024

medline: 30 5 2024

pubmed: 30 5 2024

entrez: 30 5 2024

Statut: ppublish

Résumé

Linear regression (LR) is vastly used in data analysis for continuous outcomes in biomedicine and epidemiology. Despite its popularity, LR is incompatible with missing data, which frequently occur in health sciences. For parameter estimation, this shortcoming is usually resolved by complete-case analysis or imputation. Both work-arounds, however, are inadequate for prediction, since they either fail to predict on incomplete records or ignore missingness-induced reduction in prediction accuracy and rely on (unrealistic) assumptions about the missing mechanism. Here, we derive adaptive predictor-set linear model (aps-lm), capable of making predictions for incomplete data without the need for imputation. It is derived by using a predictor-selection operation, the Moore-Penrose pseudoinverse, and the reduced QR decomposition. aps-lm is an LR generalization that inherently handles missing values. It is applied on a reference data set, where complete predictors and outcome are available, and yields a set of privacy-preserving parameters. In a second stage, these are shared for making predictions of the outcome on external data sets with missing entries for predictors without imputation. Moreover, aps-lm computes prediction errors that account for the pattern of missing values even under extreme missingness. We benchmark aps-lm in a simulation study. aps-lm showed greater prediction accuracy and reduced bias compared to popular imputation strategies under a wide range of scenarios including variation of sample size, goodness of fit, missing value type, and covariance structure. Finally, as a proof-of-principle, we apply aps-lm in the context of epigenetic aging clocks, linear models that predict a person's biological age from epigenetic data with promising clinical applications.

Identifiants

DOI: 10.1002/bimj.202300090 PMID: 38813859

pubmed: 38813859

doi: 10.1002/bimj.202300090

doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

Pagination

e2300090

Subventions

Organisme : Erasmus Universitair Medisch Centrum Rotterdam

Organisme : Deutsche Forschungsgemeinschaft

ID : FOR2488

Informations de copyright

Références

Bae, W., & Kim, M. (2014). The general linear test in the ridge regression. Communications for Statistical Applications and Methods, 21(4), 297–307. https://doi.org/10.5351/csam.2014.21.4.297

Bell, C. G., Lowe, R., Adams, P. D., Baccarelli, A. A., Beck, S., Bell, J. T., Christensen, B. C., Gladyshev, V. N., Heijmans, B. T., Horvath, S., Ideker, T., Issa, J.‐P. J., Kelsey, K. T., Marioni, R. E., Reik, W., Relton, C. L., Schalkwyk, L. C., Teschendorff, A. E., Wagner, W., … Rakyan, V. K. (2019). DNA methylation aging clocks: Challenges and recommendations. Genome Biology, 20(1), 249. https://doi.org/10.1186/s13059‐019‐1824‐y

Desai, M., Kubo, J., Esserman, D., & Terry, M. B. (2011). The handling of missing data in molecular epidemiology studies. Cancer Epidemiology, Biomarkers & Prevention, 20(8), 1571–1579. https://doi.org/10.1158/1055‐9965.EPI‐10‐1311

Di Lena, P., Sala, C., & Nardini, C. (2021). Estimage: A webserver hub for the computation of methylation age. Nucleic Acids Research, 49(W1), W199–W206. https://doi.org/10.1093/nar/gkab426

Di Lena, P., Sala, C., Prodi, A., & Nardini, C. (2019). Missing value estimation methods for DNA methylation data. Bioinformatics, 35(19), 3786–3793. https://doi.org/10.1093/bioinformatics/btz134

Friedman, J. H., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1), 1–22. https://doi.org/10.18637/jss.v033.i01

Golub, G., & Kahan, W. (1965). Calculating the singular values and pseudo‐inverse of a matrix. Journal of the Society for Industrial and Applied Mathematics: Series B, Numerical Analysis, 2(2), 205–224. https://doi.org/10.1137/0702016

Goodnight, J. H. (1979). A tutorial on the SWEEP operator. American Statistician, 33(3), 149–158. https://doi.org/10.2307/2683825

Greville, T. N. E. (1966). Note on the generalized inverse of a matrix product. SIAM Review, 8(4), 518–521. https://doi.org/10.1137/1008107

Hamm, K., & Huang, L. (2019). On column‐row matrix approximations. In 13th International Conference on Sampling Theory and Applications (pp. 1–4). IEEE.

Hoerl, A. E., & Kennard, R. W. (2000). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 42(1), 80–86. https://doi.org/10.2307/1271436

Hoogland, J., van Barreveld, M., Debray, T. P. A., Reitsma, J. B., Verstraelen, T. E., Dijkgraaf, M. G. W., & Zwinderman, A. H. (2020). Handling missing predictor values when validating and applying a prediction model to new patients. Statistics in Medicine, 39(25), 3591–3607. https://doi.org/10.1002/sim.8682

Horvath, S. (2013). DNA methylation age of human tissues and cell types. Genome Biology, 14(10), 3156. https://doi.org/10.1186/gb‐2013‐14‐10‐r115

Horvath, S., Pirazzini, C., Bacalini, M. G., Gentilini, D., Di Blasio, A. M., Delledonne, M., Mari, D., Arosio, B., Monti, D., Passarino, G., De Rango, F., D'Aquila, P., Giuliani, C., Marasco, E., Collino, S., Descombes, P., Garagnani, P., & Franceschi, C. (2015). Decreased epigenetic age of PBMCs from Italian semi‐supercentenarians and their offspring. Aging, 7(12), 1159–1170. https://doi.org/10.18632/aging.100861

Horvath, S., & Raj, K. (2018). DNA methylation‐based biomarkers and the epigenetic clock theory of ageing. Nature Reviews Genetics, 19(6), 371–384. https://doi.org/10.1038/s41576‐018‐0004‐3

Janson, L., Fithian, W., & Hastie, T. J. (2015). Effective degrees of freedom: A flawed metaphor. Biometrika, 102(2), 479–485. https://doi.org/10.1093/biomet/asv019

Janssen, K. J. M., Vergouwe, Y., Donders, A. R. T., Harrell, F. E., Jr, Chen, Q., Grobbee, D. E., & Moons, K. G. M. (2009). Dealing with missing predictor values when applying clinical prediction models. Clinical Chemistry, 55(5), 994–1001. https://doi.org/10.1373/clinchem.2008.115345

Kandimalla, R., Xu, J., Link, A., Matsuyama, T., Yamamura, K., Parker, M. I., Uetake, H., Balaguer, F., Borazanci, E., Tsai, S., Evans, D., Meltzer, S. J., Baba, H., Brand, R., Von Hoff, D., Li, W., & Goel, A. (2021). EpiPanGI Dx: A cell‐free DNA methylation fingerprint for the early detection of gastrointestinal cancers. Clinical Cancer Research, 27(22), 6135–6144. https://doi.org/10.1158/1078‐0432.CCR‐21‐1982

Li, C. (2013). Little's test of missing completely at random. Stata Journal, 13(4), 795–809. https://doi.org/10.1177/1536867X1301300407

Little, R. J. A. (1988). A test of missing completely at random for multivariate data with missing values. Journal of the American Statistical Association, 83(404), 1198–1202. https://doi.org/10.1080/01621459.1988.10478722

Little, R. J. A. (1992). Regression with missing X's: A review. Journal of the American Statistical Association, 87(420), 1227–1237. https://doi.org/10.2307/2290664

Maden, S. K., Thompson, R. F., Hansen, K. D., & Nellore, A. (2021). Human methylome variation across Infinium 450K data on the Gene Expression Omnibus. NAR Genomics and Bioinformatics, 3(2). https://doi.org/10.1093/nargab/lqab025

Malin, B. A., Emam, K. E., & O'Keefe, C. M. (2013). Biomedical data privacy: Problems, perspectives, and recent advances. Journal of the American Medical Informatics Association, 20(1), 2–6. https://doi.org/10.1136/amiajnl‐2012‐001509

Marshall, G., Warner, B., MaWhinney, S., & Hammermeister, K. (2002). Prospective prediction in the presence of missing data. Statistics in Medicine, 21(4), 561–570. https://doi.org/10.1002/sim.966

Martínez‐Plumed, F., Ferri, C., Nieves, D., & Hernández‐Orallo, J. (2021). Missing the missing values: The ugly duckling of fairness in machine learning. International Journal of Intelligent Systems, 36(7), 3217–3258. https://doi.org/10.1002/int.22415

McCaw, Z. R., Julienne, H., & Aschard, H. (2020). MGMM: An R package for fitting Gaussian mixture models on incomplete data. bioRxiv. https://doi.org/10.1101/2019.12.20.884551

McCombe, N., Liu, S., Ding, X., Prasad, G., Bucholc, M., Finn, D., Todd, S., Mcclean, P. L., & Wong‐Lin, K. (2022). Practical strategies for extreme missing data imputation in dementia diagnosis. IEEE Journal of Biomedical and Health Informatics, 26(2), 818–827. https://doi.org/10.1109/jbhi.2021.3098511

Mercaldo, S. F., & Blume, J. D. (2020). Missing data and prediction: The pattern submodel. Biostatistics, 21(2), 236–252. https://doi.org/10.1093/biostatistics/kxy040

Morris, T. P., White, I. R., & Crowther, M. J. (2019). Using simulation studies to evaluate statistical methods. Statistics in Medicine, 38(11), 2074–2102. https://doi.org/10.1002/sim.8086

Piironen, J., & Vehtari, A. (2017). Comparison of Bayesian predictive methods for model selection. Statistics and Computing, 27(3), 711–735. https://doi.org/10.1007/s11222‐016‐9649‐y

Rubin, D. B. (1976). Inference and missing data. Biometrika, 63(3), 581–592. https://doi.org/10.1093/biomet/63.3.581

Schmidt, A. F., & Finan, C. (2018). Linear regression and the normality assumption. Journal of Clinical Epidemiology, 98, 146–151. https://doi.org/10.1016/j.jclinepi.2017.12.006

Shinozaki, N., Sibuya, M., & Tanabe, K. (1972). Numerical algorithms for the Moore‐Penrose inverse of a matrix: Direct methods. Annals of the Institute of Statistical Mathematics, 24(1), 193–203. https://doi.org/10.1007/bf02479751

Sperrin, M., Martin, G. P., Sisk, R., & Peek, N. (2020). Missing data should be handled differently for prediction than for description or causal explanation. Journal of Clinical Epidemiology, 125, 183–187. https://doi.org/10.1016/j.jclinepi.2020.03.028

Sterne, J. A. C., White, I. R., Carlin, J. B., Spratt, M., Royston, P., Kenward, M. G., Wood, A. M., & Carpenter, J. R. (2009). Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ, 338, b2393. https://doi.org/10.1136/bmj.b2393

van Buuren, S., & Groothuis‐Oudshoorn, K. (2011). mice: Multivariate imputation by chained equations in R. Journal of Statistical Software, 45(3), 1–67. https://doi.org/10.18637/jss.v045.i03

Vidaki, A., & Kayser, M. (2017). From forensic epigenetics to forensic epigenomics: Broadening DNA investigative intelligence. Genome Biology, 18(1), 238. https://doi.org/10.1186/s13059‐017‐1373‐1

Wood, A. M., Royston, P., & White, I. R. (2015). The estimation and use of predictions for the assessment of model performance using large samples with multiply imputed data. Biometrical Journal, 57(4), 614–632. https://doi.org/10.1002/bimj.201400004

Xiong, Z., Li, M., Yang, F., Ma, Y., Sang, J., Li, R., Li, Z., Zhang, Z., & Bao, Y. (2020). EWAS Data Hub: A resource of DNA methylation array data and metadata. Nucleic Acids Research, 48(D1), D890–D895. https://doi.org/10.1093/nar/gkz840

Zhur, K. V., Trifonov, V. A., & Prokhortchouk, E. B. (2021). Progress and prospects in epigenetic studies of ancient DNA. Biochemistry (Moscow), 86(12), 1563–1571. https://doi.org/10.1134/s0006297921120051

Adaptive predictor-set linear model: An imputation-free method for linear regression prediction on data sets with missing values.

Journal

Informations de publication

Résumé

Identifiants

Types de publication

Langues

Sous-ensembles de citation

Pagination

Subventions

Informations de copyright

Références

Auteurs

Benjamin Planterose Jiménez (B)

Manfred Kayser (M)

Athina Vidaki (A)

Amke Caliebe (A)

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Smoking Cessation and Incident Cardiovascular Disease.

Evaluation of Low-Value Services Across Major Medicare Advantage Insurers and Traditional Medicare.

Effectiveness of Virtual Yoga for Chronic Low Back Pain: A Randomized Clinical Trial.

Classifications MeSH