Calibrating machine learning approaches for probability estimation: A comprehensive comparison.

Humans Logistic Models Machine Learning Models, Statistical Software Probability

calibration logistic regression machine learning probability estimation probability machine updating

Journal

Statistics in medicine

ISSN: 1097-0258

Titre abrégé: Stat Med

Pays: England

ID NLM: 8215016

Informations de publication

Date de publication:
20 Dec 2023

Historique:

revised: 30 08 2023

received: 20 09 2021

accepted: 18 09 2023

medline: 20 11 2023

pubmed: 18 10 2023

entrez: 18 10 2023

Statut: ppublish

Résumé

Statistical prediction models have gained popularity in applied research. One challenge is the transfer of the prediction model to a different population which may be structurally different from the model for which it has been developed. An adaptation to the new population can be achieved by calibrating the model to the characteristics of the target population, for which numerous calibration techniques exist. In view of this diversity, we performed a systematic evaluation of various popular calibration approaches used by the statistical and the machine learning communities for estimating two-class probabilities. In this work, we first provide a review of the literature and, second, present the results of a comprehensive simulation study. The calibration approaches are compared with respect to their empirical properties and relationships, their ability to generalize precise probability estimates to external populations and their availability in terms of easy-to-use software implementations. Third, we provide code from real data analysis allowing its application by researchers. Logistic calibration and beta calibration, which estimate an intercept plus one and two slope parameters, respectively, consistently showed the best results in the simulation studies. Calibration on logit transformed probability estimates generally outperformed calibration methods on nontransformed estimates. In case of structural differences between training and validation data, re-estimation of the entire prediction model should be outweighted against sample size of the validation data. We recommend regression-based calibration approaches using transformed probability estimates, where at least one slope is estimated in addition to an intercept for updating probability estimates in validation studies.

Identifiants

DOI: 10.1002/sim.9921 PMID: 37849356

pubmed: 37849356

doi: 10.1002/sim.9921

doi:

Types de publication

Review Journal Article

Langues

eng

Sous-ensembles de citation

Pagination

5451-5478

Informations de copyright

Références

Diamond GA, Forrester JS. Analysis of probability as an aid in the clinical diagnosis of coronary-artery disease. N Engl J Med. 1979;300:1350-1358. doi:10.1056/NEJM197906143002402

Xie G, Wang R, Shang L, et al. Calculating the overall survival probability in patients with cervical cancer: a nomogram and decision curve analysis-based study. BMC Cancer. 2020;20:833. doi:10.1186/s12885-020-07349-4

Boyer B, Cazorla C. Methods and probability of success after early revision of prosthetic joint infections with debridement, antibiotics and implant retention. Orthop Traumatol Surg Res. 2021;107:102774. doi:10.1016/j.otsr.2020.102774

Uttley AM. Temporal and spatial patterns in a conditional probability machine. In: Shannon CE, McCarthy J, eds. Automata Studies. Princeton: Princeton University Press; 1956:277-285.

Altman DG, Royston P. What do we mean by validating a prognostic model? Stat Med. 2000;19:453-473. doi:10.1002/(SICI)1097-0258(20000229)19:4<453::AID-SIM350>3.0.CO;2-5

Justice AC, Covinsky KE, Berlin JA. Assessing the generalizability of prognostic information. Ann Intern Med. 1999;130:515-524. doi:10.7326/0003-4819-130-6-199903160-00016

Van Calster B, Nieboer D, Vergouwe Y, De Cock B, Pencina MJ, Steyerberg EW. A calibration hierarchy for risk models was defined: from utopia to empirical data. J Clin Epidemiol. 2016;74:167-176. doi:10.1016/j.jclinepi.2015.12.005

Kull M, Silva Filho TM, Flach P. Beyond sigmoids: how to obtain well-calibrated probabilities from binary classifiers with beta calibration. Electron J Statist. 2017;11:5052-5080. doi:10.1214/17-EJS1338SI

Böken B. On the appropriateness of Platt scaling in classifier calibration. Inf Syst. 2021;95:101641. doi:10.1016/j.is.2020.101641

Platt J. Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. In: Smola AJ, Bartlett PJ, Schölkopf B, Schuurmans D, eds. Advances in Large Margin Classifiers. Cambridge: MIT Press; 2000:61-74.

Fawcett T, Niculescu-Mizil A. PAV and the ROC convex hull. Mach Learn. 2007;68:97-106. doi:10.1007/s10994-007-5011-0

Zadrozny B, Elkan C. Transforming classifier scores into accurate multiclass probability estimates. In: Hand D, Keim DA, Ng R, eds. Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: Association for Computing Machinery; 2002:694-699. doi:10.1145/775047.775151

Elkan C. The foundations of cost-sensitive learning. In: Nebel B, ed. Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence. Vol 2. San Francisco: Morgan Kaufmann; 2001:973-978.

Dankowski T, Ziegler A. Calibrating random forests for probability estimation. Stat Med. 2016;35:3949-3960. doi:10.1002/sim.6959

Dua D, Graff C. UCI Machine Learning Repository. Irvine, CA: School of Information and Computer Sciences, University of California; 2019. https://archive-beta.ics.uci.edu. Accessed June 1, 2023

R Core Team. R: a language and environment for statistical computing. 2022.

Miller ME, Hui SL, Tierney WM. Validation techniques for logistic regression models. Stat Med. 1991;10:1213-1226. doi:10.1002/sim.4780100805

Cox DR. Two further applications of a model for binary regression. Biometrika. 1958;45:562-565. doi:10.2307/2333203

Steyerberg EW. Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. 2nd ed. Cham: Springer; 2019.

Lucena B. Spline-based probability calibration. arXiv 2018: 1809.07751. https://arxiv.org/abs/1809.07751. Accessed June 1, 2023.

Regression HJ, Strategies M. With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis. 2nd ed. Cham: Springer; 2015.

Zhang J, Yang Y. Probabilistic score estimation with piecewise logistic regression. In: Greiner R, Schuurmans D, eds. Proceedings of the 21st$$ {\kern0em }^{st} $$ International Conference on Machine Learning. New York: ACM Press; 2004:115-123.

Dormann CF. Calibration of probability predictions from machine-learning and statistical models. Glob Ecol Biogeogr. 2020;29:760-765. doi:10.1111/geb.13070

Landwehr N, Hall M, Frank E. Logistic model trees. Mach Learn. 2005;59:161-205. doi:10.1007/s10994-005-0466-3

Leathart T, Frank E, Holmes G, Pfahringer B. Probability calibration trees. In: Min-Ling Z, Yung-Kyun N, eds. Proceedings of the 9th Asian Conference on Machine Learning. Cambridge, MA: ML Research Press; 2017:145-160. http://proceedings.mlr.press/v77/leathart17a.html. Accessed June 1, 2023.

de Leeuw J, Hornik K, Mair P. Isotone optimization in R: pool-adjacent-violators algorithm (PAVA) and active set methods. J Stat Softw. 2009;32:1-24. doi:10.18637/jss.v032.i05

Bella A, Ferri C, Hernández-Orallo J, Ramírez-Quintana MJ. Calibration of machine learning models. In: Soria Olivas E, Martín Guerrero JD, Martinez Sober M, Magdalena Benedito JR, Serrano López AJ, eds. Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques. Hershey: IGI Global; 2010:128-146. doi:10.4018/978-1-60960-818-7.ch104

Huang Y, Li W, Macheret F, Gabriel RA, Ohno-Machado L. A tutorial on calibration measurements and calibration models for clinical prediction models. J Am Med Inform Assoc. 2020;27:621-633. doi:10.1093/jamia/ocz228

Dimitriadis T, Gneiting T, Jordan AI. Stable reliability diagrams for probabilistic classifiers. Proc Natl Acad Sci. 2021;118:e2016191118. doi:10.1073/pnas.2016191118

Naeini MP, Cooper GF, Hauskrecht M. Obtaining well calibrated probabilities using Bayesian binning. Proc Conf AAAI Artif Intell. 2015;2015:2901-2907.

Zadrozny B, Elkan C. Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers. In: Brodley CE, Danyluk AP, eds. Proceedings of the 18th International Conference on Machine Learning (ICML 2001). Burlington: Morgan Kaufmann; 2001:609-2616.

Chen W, Sahiner B, Samuelson F, Pezeshk A, Petrick N. Calibration of medical diagnostic classifier scores to the probability of disease. Stat Methods Med Res. 2018;27:1394-1409. doi:10.1177/0962280216661371

Bella A, Ferri C, Hernández-Orallo J, Ramírez-Quintana MJ. Similarity-binning averaging: a generalisation of binning calibration. In: Corchado E, Yin H, eds. Intelligent Data Engineering and Automated Learning - IDEAL 2009. Berlin: Springer; 2009:341-349. doi:10.1007/978-3-642-04394-9_42

Biau G, Cérou F, Guyader A. Rates of convergence of the functional k-nearest neighbor estimate. IEEE Transact Inform Theor. 2010;56:2034-2040. doi:10.1109/TIT.2010.2040857

Bella A, Ferri C, Hernández-Orallo J, Ramírez-Quintana MJ. On the effect of calibration in classifier combination. Appl Intell. 2013;38:566-585. doi:10.1007/s10489-012-0388-2

Jiang X, Osl M, Kim J, Ohno-Machado L. Calibrating predictive model estimates to support personalized medicine. J Am Med Inform Assoc. 2012;19:263-274. doi:10.1136/amiajnl-2011-000291

Brier GW. Verification of forecasts expressed in terms of probability. Mon Wea Rev. 1950;78:1-3. doi:10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2

Kruppa J, Liu Y, Diener HC, et al. Probability estimation with machine learning methods for dichotomous and multi-category outcome: applications. Biom J. 2014;56:564-583. doi:10.1002/bimj.201300077

Vovk V. The fundamental nature of the log loss function. In: Beklemishev LD, Blass A, Dershowitz N, Finkbeiner B, Schulte W, eds. Fields of Logic and Computation II: Essays Dedicated to Yuri Gurevich on the Occasion of his 75th Birthday. Cham: Springer; 2015:307-318. doi:10.1007/978-3-319-23534-9_20

Gneiting T, Raftery AE. Strictly proper scoring rules, prediction, and estimation. J Am Stat Assoc. 2007;102:359-378. doi:10.1198/016214506000001437

Malley JD, Kruppa J, Dasgupta A, Malley KG, Ziegler A. Probability machines: consistent probability estimation using nonparametric learning machines. Methods Inf Med. 2012;51:74-81. doi:10.3414/ME00-01-0052

Demšar J. Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res. 2006;7:1-30.

Mease D, Wyner AJ, Buja A. Boosted classification trees and class probability/quantile estimation. J Mach Learn Res. 2007;8:409-439.

Weimar C, Ziegler A, König IR, Diener HC. On behalf of the German stroke study collaborators. Predicting functional outcome and survival after acute ischemic stroke. J Neurol. 2002;249:888-895. doi:10.1007/s00415-002-0755-8

Weimar C, König IR, Kraywinkel K, Ziegler A, Diener HC, German Stroke Study Collaboration. Age and National Institutes of Health stroke scale score within 6 hours after onset are accurate predictors of outcome after cerebral ischemia: development and external validation of prognostic models. Stroke. 2004;35:158-162. doi:10.1161/01.STR.0000106761.94985.8B

Collins GS, Reitsma JB, Altman DG, Moons KG. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. Bmj. 2015;350:g7594. doi:10.1136/bmj.g7594

König IR, Weimar C, Diener HC, Ziegler A. Vorhersage des Funktionsstatus 100 Tage nach einem ischämischen Schlaganfall: design einer prospektiven Studie zur externen Validierung eines prognostischen Modells. Z Arztl Fortbild Qualitatssich. 2003;97:717-722.

Mahoney FI, Barthel DW. Functional evaluation: the Barthel index. Md Med J. 1965;14:56-61.

König IR, Malley JD, Weimar C, Diener HC, Ziegler A. On behalf of the German stroke study collaboration. Practical experiences on the necessity of external validation. Stat Med. 2007;26:5499-5511. doi:10.1002/sim.3069

Watson DS, Wright MN. Testing conditional independence in supervised learning algorithms. Mach Learn. 2021;110:2107-2129. doi:10.1007/s10994-021-06030-6

Seiffert M, Ojeda F, Müllerleile K, et al. Reducing radiation exposure during invasive coronary angiography and percutaneous coronary interventions implementing a simple four-step protocol. Clin Res Cardiol. 2015;104:500-506. doi:10.1007/s00392-015-0814-7

Detrano R, Janosi A, Steinbrunn W, et al. International application of a new probability algorithm for the diagnosis of coronary artery disease. Am J Cardiol. 1989;64:304-310. doi:10.1016/0002-9149(89)90524-9

Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Stat Soc B. 2005;67:301-320. doi:10.1111/j.1467-9868.2005.00503.x

Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010;33:1-22. doi:10.18637/jss.v033.i01

Bühlmann P, Hothorn T. Boosting algorithms: regularization, prediction and model fitting. Stat Sci. 2007;22:477-505. doi:10.1214/07-STS242

Hofner B, Mayr A, Robinzonov N, Schmid M. Model-based boosting in R: a hands-on tutorial using the R package mboost. Comput Stat. 2014;29:3-35. doi:10.1007/s00180-012-0382-5

Mayr A, Binder H, Gefeller O, Schmid M. The evolution of boosting algorithms - from machine learning to statistical modelling. Methods Inf Med. 2014;53:419-427. doi:10.3414/ME13-01-0122

Ziegler A, König IR. Mining data with random forests: current options for real-world applications. WIRE Data Mining Knowl Discov. 2014;4:55-63. doi:10.1002/widm.1114

Wright MN, Ziegler A. Ranger: a fast implementation of random forests for high dimensional data in C++ and R. J Stat Softw. 2017;77:1.

Kruppa J, Liu Y, Biau G, et al. Probability estimation with machine learning methods for dichotomous and multicategory outcome: theory. Biom J. 2014;56:534-563. doi:10.1002/bimj.201300068

Christmann A, Steinwart I. Support Vector Machines. New York: Springer; 2008.

Karatzoglou A, Smola A, Hornik K, Zeileis A. Kernlab - an S4 package for kernel methods in R. J Stat Softw. 2004;11:1-20.

Torgo L. An infra-structure for performance estimation and experimental comparison of predictive models in R. arXiv 2015: 1412.0436v4. 2015 https://arxiv.org/abs/1412.0436v4. Accessed June 1, 2023.

Xu P, Davoine F, Zha H, Denœux T. Evidential calibration of binary SVM classifiers. Int J Approx Reason. 2016;72:55-70. doi:10.1016/j.ijar.2015.05.002

Austin PC, Steyerberg EW. Events per variable (EPV) and the relative performance of different strategies for estimating the out-of-sample validity of logistic regression models. Stat Methods Med Res. 2017;26:796-808. doi:10.1177/0962280214558972

Calibrating machine learning approaches for probability estimation: A comprehensive comparison.

Journal

Informations de publication

Résumé

Identifiants

Types de publication

Langues

Sous-ensembles de citation

Pagination

Informations de copyright

Références

Auteurs

Francisco M Ojeda (FM)

Max L Jansen (ML)

Alexandre Thiéry (A)

Stefan Blankenberg (S)

Christian Weimar (C)

Matthias Schmid (M)

Andreas Ziegler (A)

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Smoking Cessation and Incident Cardiovascular Disease.

Evaluation of Low-Value Services Across Major Medicare Advantage Insurers and Traditional Medicare.

Effectiveness of Virtual Yoga for Chronic Low Back Pain: A Randomized Clinical Trial.

Classifications MeSH