Calibrating machine learning approaches for probability estimation: A comprehensive comparison.
calibration
logistic regression
machine learning
probability estimation
probability machine
updating
Journal
Statistics in medicine
ISSN: 1097-0258
Titre abrégé: Stat Med
Pays: England
ID NLM: 8215016
Informations de publication
Date de publication:
20 Dec 2023
20 Dec 2023
Historique:
revised:
30
08
2023
received:
20
09
2021
accepted:
18
09
2023
medline:
20
11
2023
pubmed:
18
10
2023
entrez:
18
10
2023
Statut:
ppublish
Résumé
Statistical prediction models have gained popularity in applied research. One challenge is the transfer of the prediction model to a different population which may be structurally different from the model for which it has been developed. An adaptation to the new population can be achieved by calibrating the model to the characteristics of the target population, for which numerous calibration techniques exist. In view of this diversity, we performed a systematic evaluation of various popular calibration approaches used by the statistical and the machine learning communities for estimating two-class probabilities. In this work, we first provide a review of the literature and, second, present the results of a comprehensive simulation study. The calibration approaches are compared with respect to their empirical properties and relationships, their ability to generalize precise probability estimates to external populations and their availability in terms of easy-to-use software implementations. Third, we provide code from real data analysis allowing its application by researchers. Logistic calibration and beta calibration, which estimate an intercept plus one and two slope parameters, respectively, consistently showed the best results in the simulation studies. Calibration on logit transformed probability estimates generally outperformed calibration methods on nontransformed estimates. In case of structural differences between training and validation data, re-estimation of the entire prediction model should be outweighted against sample size of the validation data. We recommend regression-based calibration approaches using transformed probability estimates, where at least one slope is estimated in addition to an intercept for updating probability estimates in validation studies.
Types de publication
Review
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Pagination
5451-5478Informations de copyright
© 2023 The Authors. Statistics in Medicine published by John Wiley & Sons Ltd.
Références
Diamond GA, Forrester JS. Analysis of probability as an aid in the clinical diagnosis of coronary-artery disease. N Engl J Med. 1979;300:1350-1358. doi:10.1056/NEJM197906143002402
Xie G, Wang R, Shang L, et al. Calculating the overall survival probability in patients with cervical cancer: a nomogram and decision curve analysis-based study. BMC Cancer. 2020;20:833. doi:10.1186/s12885-020-07349-4
Boyer B, Cazorla C. Methods and probability of success after early revision of prosthetic joint infections with debridement, antibiotics and implant retention. Orthop Traumatol Surg Res. 2021;107:102774. doi:10.1016/j.otsr.2020.102774
Uttley AM. Temporal and spatial patterns in a conditional probability machine. In: Shannon CE, McCarthy J, eds. Automata Studies. Princeton: Princeton University Press; 1956:277-285.
Altman DG, Royston P. What do we mean by validating a prognostic model? Stat Med. 2000;19:453-473. doi:10.1002/(SICI)1097-0258(20000229)19:4<453::AID-SIM350>3.0.CO;2-5
Justice AC, Covinsky KE, Berlin JA. Assessing the generalizability of prognostic information. Ann Intern Med. 1999;130:515-524. doi:10.7326/0003-4819-130-6-199903160-00016
Van Calster B, Nieboer D, Vergouwe Y, De Cock B, Pencina MJ, Steyerberg EW. A calibration hierarchy for risk models was defined: from utopia to empirical data. J Clin Epidemiol. 2016;74:167-176. doi:10.1016/j.jclinepi.2015.12.005
Kull M, Silva Filho TM, Flach P. Beyond sigmoids: how to obtain well-calibrated probabilities from binary classifiers with beta calibration. Electron J Statist. 2017;11:5052-5080. doi:10.1214/17-EJS1338SI
Böken B. On the appropriateness of Platt scaling in classifier calibration. Inf Syst. 2021;95:101641. doi:10.1016/j.is.2020.101641
Platt J. Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. In: Smola AJ, Bartlett PJ, Schölkopf B, Schuurmans D, eds. Advances in Large Margin Classifiers. Cambridge: MIT Press; 2000:61-74.
Fawcett T, Niculescu-Mizil A. PAV and the ROC convex hull. Mach Learn. 2007;68:97-106. doi:10.1007/s10994-007-5011-0
Zadrozny B, Elkan C. Transforming classifier scores into accurate multiclass probability estimates. In: Hand D, Keim DA, Ng R, eds. Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: Association for Computing Machinery; 2002:694-699. doi:10.1145/775047.775151
Elkan C. The foundations of cost-sensitive learning. In: Nebel B, ed. Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence. Vol 2. San Francisco: Morgan Kaufmann; 2001:973-978.
Dankowski T, Ziegler A. Calibrating random forests for probability estimation. Stat Med. 2016;35:3949-3960. doi:10.1002/sim.6959
Dua D, Graff C. UCI Machine Learning Repository. Irvine, CA: School of Information and Computer Sciences, University of California; 2019. https://archive-beta.ics.uci.edu. Accessed June 1, 2023
R Core Team. R: a language and environment for statistical computing. 2022.
Miller ME, Hui SL, Tierney WM. Validation techniques for logistic regression models. Stat Med. 1991;10:1213-1226. doi:10.1002/sim.4780100805
Cox DR. Two further applications of a model for binary regression. Biometrika. 1958;45:562-565. doi:10.2307/2333203
Steyerberg EW. Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. 2nd ed. Cham: Springer; 2019.
Lucena B. Spline-based probability calibration. arXiv 2018: 1809.07751. https://arxiv.org/abs/1809.07751. Accessed June 1, 2023.
Regression HJ, Strategies M. With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis. 2nd ed. Cham: Springer; 2015.
Zhang J, Yang Y. Probabilistic score estimation with piecewise logistic regression. In: Greiner R, Schuurmans D, eds. Proceedings of the 21st$$ {\kern0em }^{st} $$ International Conference on Machine Learning. New York: ACM Press; 2004:115-123.
Dormann CF. Calibration of probability predictions from machine-learning and statistical models. Glob Ecol Biogeogr. 2020;29:760-765. doi:10.1111/geb.13070
Landwehr N, Hall M, Frank E. Logistic model trees. Mach Learn. 2005;59:161-205. doi:10.1007/s10994-005-0466-3
Leathart T, Frank E, Holmes G, Pfahringer B. Probability calibration trees. In: Min-Ling Z, Yung-Kyun N, eds. Proceedings of the 9th Asian Conference on Machine Learning. Cambridge, MA: ML Research Press; 2017:145-160. http://proceedings.mlr.press/v77/leathart17a.html. Accessed June 1, 2023.
de Leeuw J, Hornik K, Mair P. Isotone optimization in R: pool-adjacent-violators algorithm (PAVA) and active set methods. J Stat Softw. 2009;32:1-24. doi:10.18637/jss.v032.i05
Bella A, Ferri C, Hernández-Orallo J, Ramírez-Quintana MJ. Calibration of machine learning models. In: Soria Olivas E, Martín Guerrero JD, Martinez Sober M, Magdalena Benedito JR, Serrano López AJ, eds. Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques. Hershey: IGI Global; 2010:128-146. doi:10.4018/978-1-60960-818-7.ch104
Huang Y, Li W, Macheret F, Gabriel RA, Ohno-Machado L. A tutorial on calibration measurements and calibration models for clinical prediction models. J Am Med Inform Assoc. 2020;27:621-633. doi:10.1093/jamia/ocz228
Dimitriadis T, Gneiting T, Jordan AI. Stable reliability diagrams for probabilistic classifiers. Proc Natl Acad Sci. 2021;118:e2016191118. doi:10.1073/pnas.2016191118
Naeini MP, Cooper GF, Hauskrecht M. Obtaining well calibrated probabilities using Bayesian binning. Proc Conf AAAI Artif Intell. 2015;2015:2901-2907.
Zadrozny B, Elkan C. Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers. In: Brodley CE, Danyluk AP, eds. Proceedings of the 18th International Conference on Machine Learning (ICML 2001). Burlington: Morgan Kaufmann; 2001:609-2616.
Chen W, Sahiner B, Samuelson F, Pezeshk A, Petrick N. Calibration of medical diagnostic classifier scores to the probability of disease. Stat Methods Med Res. 2018;27:1394-1409. doi:10.1177/0962280216661371
Bella A, Ferri C, Hernández-Orallo J, Ramírez-Quintana MJ. Similarity-binning averaging: a generalisation of binning calibration. In: Corchado E, Yin H, eds. Intelligent Data Engineering and Automated Learning - IDEAL 2009. Berlin: Springer; 2009:341-349. doi:10.1007/978-3-642-04394-9_42
Biau G, Cérou F, Guyader A. Rates of convergence of the functional k-nearest neighbor estimate. IEEE Transact Inform Theor. 2010;56:2034-2040. doi:10.1109/TIT.2010.2040857
Bella A, Ferri C, Hernández-Orallo J, Ramírez-Quintana MJ. On the effect of calibration in classifier combination. Appl Intell. 2013;38:566-585. doi:10.1007/s10489-012-0388-2
Jiang X, Osl M, Kim J, Ohno-Machado L. Calibrating predictive model estimates to support personalized medicine. J Am Med Inform Assoc. 2012;19:263-274. doi:10.1136/amiajnl-2011-000291
Brier GW. Verification of forecasts expressed in terms of probability. Mon Wea Rev. 1950;78:1-3. doi:10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2
Kruppa J, Liu Y, Diener HC, et al. Probability estimation with machine learning methods for dichotomous and multi-category outcome: applications. Biom J. 2014;56:564-583. doi:10.1002/bimj.201300077
Vovk V. The fundamental nature of the log loss function. In: Beklemishev LD, Blass A, Dershowitz N, Finkbeiner B, Schulte W, eds. Fields of Logic and Computation II: Essays Dedicated to Yuri Gurevich on the Occasion of his 75th Birthday. Cham: Springer; 2015:307-318. doi:10.1007/978-3-319-23534-9_20
Gneiting T, Raftery AE. Strictly proper scoring rules, prediction, and estimation. J Am Stat Assoc. 2007;102:359-378. doi:10.1198/016214506000001437
Malley JD, Kruppa J, Dasgupta A, Malley KG, Ziegler A. Probability machines: consistent probability estimation using nonparametric learning machines. Methods Inf Med. 2012;51:74-81. doi:10.3414/ME00-01-0052
Demšar J. Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res. 2006;7:1-30.
Mease D, Wyner AJ, Buja A. Boosted classification trees and class probability/quantile estimation. J Mach Learn Res. 2007;8:409-439.
Weimar C, Ziegler A, König IR, Diener HC. On behalf of the German stroke study collaborators. Predicting functional outcome and survival after acute ischemic stroke. J Neurol. 2002;249:888-895. doi:10.1007/s00415-002-0755-8
Weimar C, König IR, Kraywinkel K, Ziegler A, Diener HC, German Stroke Study Collaboration. Age and National Institutes of Health stroke scale score within 6 hours after onset are accurate predictors of outcome after cerebral ischemia: development and external validation of prognostic models. Stroke. 2004;35:158-162. doi:10.1161/01.STR.0000106761.94985.8B
Collins GS, Reitsma JB, Altman DG, Moons KG. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. Bmj. 2015;350:g7594. doi:10.1136/bmj.g7594
König IR, Weimar C, Diener HC, Ziegler A. Vorhersage des Funktionsstatus 100 Tage nach einem ischämischen Schlaganfall: design einer prospektiven Studie zur externen Validierung eines prognostischen Modells. Z Arztl Fortbild Qualitatssich. 2003;97:717-722.
Mahoney FI, Barthel DW. Functional evaluation: the Barthel index. Md Med J. 1965;14:56-61.
König IR, Malley JD, Weimar C, Diener HC, Ziegler A. On behalf of the German stroke study collaboration. Practical experiences on the necessity of external validation. Stat Med. 2007;26:5499-5511. doi:10.1002/sim.3069
Watson DS, Wright MN. Testing conditional independence in supervised learning algorithms. Mach Learn. 2021;110:2107-2129. doi:10.1007/s10994-021-06030-6
Seiffert M, Ojeda F, Müllerleile K, et al. Reducing radiation exposure during invasive coronary angiography and percutaneous coronary interventions implementing a simple four-step protocol. Clin Res Cardiol. 2015;104:500-506. doi:10.1007/s00392-015-0814-7
Detrano R, Janosi A, Steinbrunn W, et al. International application of a new probability algorithm for the diagnosis of coronary artery disease. Am J Cardiol. 1989;64:304-310. doi:10.1016/0002-9149(89)90524-9
Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Stat Soc B. 2005;67:301-320. doi:10.1111/j.1467-9868.2005.00503.x
Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010;33:1-22. doi:10.18637/jss.v033.i01
Bühlmann P, Hothorn T. Boosting algorithms: regularization, prediction and model fitting. Stat Sci. 2007;22:477-505. doi:10.1214/07-STS242
Hofner B, Mayr A, Robinzonov N, Schmid M. Model-based boosting in R: a hands-on tutorial using the R package mboost. Comput Stat. 2014;29:3-35. doi:10.1007/s00180-012-0382-5
Mayr A, Binder H, Gefeller O, Schmid M. The evolution of boosting algorithms - from machine learning to statistical modelling. Methods Inf Med. 2014;53:419-427. doi:10.3414/ME13-01-0122
Ziegler A, König IR. Mining data with random forests: current options for real-world applications. WIRE Data Mining Knowl Discov. 2014;4:55-63. doi:10.1002/widm.1114
Wright MN, Ziegler A. Ranger: a fast implementation of random forests for high dimensional data in C++ and R. J Stat Softw. 2017;77:1.
Kruppa J, Liu Y, Biau G, et al. Probability estimation with machine learning methods for dichotomous and multicategory outcome: theory. Biom J. 2014;56:534-563. doi:10.1002/bimj.201300068
Christmann A, Steinwart I. Support Vector Machines. New York: Springer; 2008.
Karatzoglou A, Smola A, Hornik K, Zeileis A. Kernlab - an S4 package for kernel methods in R. J Stat Softw. 2004;11:1-20.
Torgo L. An infra-structure for performance estimation and experimental comparison of predictive models in R. arXiv 2015: 1412.0436v4. 2015 https://arxiv.org/abs/1412.0436v4. Accessed June 1, 2023.
Xu P, Davoine F, Zha H, Denœux T. Evidential calibration of binary SVM classifiers. Int J Approx Reason. 2016;72:55-70. doi:10.1016/j.ijar.2015.05.002
Austin PC, Steyerberg EW. Events per variable (EPV) and the relative performance of different strategies for estimating the out-of-sample validity of logistic regression models. Stat Methods Med Res. 2017;26:796-808. doi:10.1177/0962280214558972