On the role of benchmarking data sets and simulations in method comparison studies.
benchmarking
machine learning
neutral comparison studies
simulation studies
Journal
Biometrical journal. Biometrische Zeitschrift
ISSN: 1521-4036
Titre abrégé: Biom J
Pays: Germany
ID NLM: 7708048
Informations de publication
Date de publication:
21 Feb 2023
21 Feb 2023
Historique:
revised:
26
01
2023
received:
02
08
2022
accepted:
01
02
2023
entrez:
22
2
2023
pubmed:
23
2
2023
medline:
23
2
2023
Statut:
aheadofprint
Résumé
Method comparisons are essential to provide recommendations and guidance for applied researchers, who often have to choose from a plethora of available approaches. While many comparisons exist in the literature, these are often not neutral but favor a novel method. Apart from the choice of design and a proper reporting of the findings, there are different approaches concerning the underlying data for such method comparison studies. Most manuscripts on statistical methodology rely on simulation studies and provide a single real-world data set as an example to motivate and illustrate the methodology investigated. In the context of supervised learning, in contrast, methods are often evaluated using so-called benchmarking data sets, that is, real-world data that serve as gold standard in the community. Simulation studies, on the other hand, are much less common in this context. The aim of this paper is to investigate differences and similarities between these approaches, to discuss their advantages and disadvantages, and ultimately to develop new approaches to the evaluation of methods picking the best of both worlds. To this aim, we borrow ideas from different contexts such as mixed methods research and Clinical Scenario Evaluation.
Identifiants
pubmed: 36810737
doi: 10.1002/bimj.202200212
doi:
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Pagination
e2200212Subventions
Organisme : Deutsche Forschungsgemeinschaft
ID : FR 3070/3-1
Organisme : Deutsche Forschungsgemeinschaft
ID : FR 3070/4-1
Organisme : Deutsche Forschungsgemeinschaft
ID : FR 4121/2-1
Informations de copyright
© 2023 The Authors. Biometrical Journal published by Wiley-VCH GmbH.
Références
Allignol, A., Beyersmann, J., & Schmoor, C. (2016). Statistical issues in the analysis of adverse events in time-to-event data. Pharmaceutical Statistics, 15(4), 297-305.
Austin, P. C. (2007). The performance of different propensity score methods for estimating marginal odds ratios. Statistics in Medicine, 26(16), 3078-3094.
Austin, P. C. (2010). A data-generation process for data with specified risk differences or numbers needed to treat. Communications in Statistics - Simulation and Computation, 39(3), 563-577.
Bao, M., Zhou, A., Zottola, S., Brubach, B., Desmarais, S., Horowitz, A., Lum, K., & Venkatasubramanian, S. (2021). It's COMPASlicated: The messy relationship between RAI datasets and algorithmic fairness benchmarks. arXiv preprint arXiv:2106.05498.
Batty, M. (2018). Digital twins. Environment and Planning B: Urban Analytics and City Science, 45(5), 817-820.
Behboodi, B., & Rivaz, H. (2019). Ultrasound segmentation using U-Net: learning from simulated data and testing on real data. In 2019 41st annual international conference of the IEEE engineering in medicine and biology society (EMBC). IEEE.
Benda, N., Branson, M., Maurer, W., & Friede, T. (2010). Aspects of modernizing drug development using clinical scenario planning and evaluation. Drug Information Journal, 44(3), 299-315.
Bender, R., Beckmann, L., & Lange, S. (2016). Biometrical issues in the analysis of adverse events within the benefit assessment of drugs. Pharmaceutical Statistics, 15(4), 292-296.
Bischl, B., Schiffner, J., & Weihs, C. (2013). Benchmarking local classification methods. Computational Statistics, 28(6), 2599-2619.
Bluhmki, T., Dobler, D., Beyersmann, J., & Pauly, M. (2018). The wild bootstrap for multivariate Nelson-Aalen estimators. Lifetime Data Analysis, 25(1), 97-127.
Bluhmki, T., Putter, H., Allignol, A., & Beyersmann, J. (2019). Bootstrapping complex time-to-event data without individual patient data, with a view toward time-dependent exposures. Statistics in Medicine, 38(20), 3747-3763.
Boulesteix, A.-L. (2015). Ten simple rules for reducing overoptimistic reporting in methodological computational research. PLOS Computational Biology, 11(4), e1004191.
Boulesteix, A.-L., Binder, H., Abrahamowicz, M., & Sauerbrei, W. (2017). On the necessity and design of studies comparing statistical methods. Biometrical Journal, 60(1), 216-218.
Boulesteix, A.-L., Groenwold, R. H. H., Abrahamowicz, M., Binder, H., Briel, M., Hornung, R., Morris, T. P., Rahnenführer, J., & Sauerbrei, W. (2020). Introduction to statistical simulations in health research. BMJ Open, 10(12), e039921.
Boulesteix, A.-L., Hable, R., Lauer, S., & Eugster, M. J. A. (2015). A statistical framework for hypothesis testing in real data comparison studies. The American Statistician, 69(3), 201-212.
Boulesteix, A.-L., Lauer, S., & Eugster, M. J. A. (2013). A plea for neutral comparison studies in computational sciences. PLoS ONE, 8(4), e61562.
Buchka, S., Hapfelmeier, A., Gardner, P. P., Wilson, R., & Boulesteix, A.-L. (2021). On the optimistic performance evaluation of newly introduced bioinformatic methods. Genome Biology, 22(1), 1-8.
Burton, A., Altman, D. G., Royston, P., & Holder, R. L. (2006). The design of simulation studies in medical statistics. Statistics in Medicine, 25(24), 4279-4292.
Casalicchio, G., Bossek, J., Lang, M., Kirchhoff, D., Kerschke, P., Hofner, B., Seibold, H., Vanschoren, J., & Bischl, B. (2019). OpenML: An R package to connect to the machine learning platform OpenML. Computational Statistics, 34(3), 977-991.
Chipman, H., & Bingham, D. (2022). Let's practice what we preach: Planning and interpreting simulation studies with design and analysis of experiments. Canadian Journal of Statistics, 50(4), 1228-1249.
Church, K. W., & Hestness, J. (2019). A survey of 25 years of evaluation. Natural Language Engineering, 25(06), 753-767.
Clark, D. A., & Handcock, M. S. (2022). Comparing the real-world performance of exponential-family random graph models and latent order logistic models for social network analysis. Journal of the Royal Statistical Society: Series A (Statistics in Society), 185(2), 566-587.
Clark, L., Arundel, C., Coleman, E., Doherty, L., Parker, A., Hewitt, C., Beard, D., Bower, P., Brocklehurst, P., Cooper, C., Culliford, L., Devane, D., Emsley, R., Eldridge, S., Galvin, S., Gillies, K., Montgomery, A., Sutton, C., Treweek, S., & Torgerson, D. (2022). The PROMoting the USE of SWATs (PROMETHEUS) programme: Lessons learnt and future developments for SWATs. Research Methods in Medicine & Health Sciences, 3(4), 100-106.
Clermont, G., Bartels, J., Kumar, R., Constantine, G., Vodovotz, Y., & Chow, C. (2004). In silico design of clinical trials: A method coming of age. Critical Care Medicine, 32(10), 2061-2070.
Creswell, J. W., Klassen, A. C., Plano Clark, V. L., & Smith, K. C. (2011). Best practices for mixed methods research in the health sciences. Bethesda (Maryland): National Institutes of Health, 2013, 541-545.
Creswell, J. W., & Plano Clark, V. L. (2017). Designing and conducting mixed methods research. Sage Publications.
Crowther, M. J., & Lambert, P. C. (2013). Simulating biologically plausible complex survival data. Statistics in Medicine, 32(23), 4118-4134.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition (pp. 248-255). IEEE.
Ditzhaus, M., & Friedrich, S. (2020). More powerful logrank permutation tests for two-sample survival data. Journal of Statistical Computation and Simulation, 90(12), 2209-2227.
Dmitrienko, A., & Pulkstenis, E. (2017). Clinical trial optimization using R. CRC Press.
Dormuth, I., Liu, T., Xu, J., Yu, M., Pauly, M., & Ditzhaus, M. (2022). Which test for crossing survival curves? A user's guideline. BMC Medical Research Methodology, 22(1), 34.
Dua, D., & Graff, C. (2017). UCI machine learning repository. https://archive.ics.uci.edu/ml/index.php
Dwivedi, V. P., Joshi, C. K., Laurent, T., Bengio, Y., & Bresson, X. (2022). Benchmarking graph neural networks. Journal of Machine Learning Research, 23, 1-48.
Franklin, J. M., Schneeweiss, S., Polinski, J. M., & Rassen, J. A. (2014). Plasmode simulation for the evaluation of pharmacoepidemiologic methods in complex healthcare databases. Computational Statistics & Data Analysis, 72, 219-226.
Friede, T., Nicholas, R., Stallard, N., Todd, S., Parsons, N., Valdés-Márquez, E., & Chataway, J. (2010). Refinement of the Clinical Scenario Evaluation framework for assessment of competing development strategies with an application to multiple sclerosis. Drug Information Journal, 44(6), 713-718.
Friede, T., Parsons, N., Stallard, N., Todd, S., Marquez, E. V., Chataway, J., & Nicholas, R. (2011). Designing a seamless phase II/III clinical trial using early outcomes for treatment selection: An application in multiple sclerosis. Statistics in Medicine, 30(13), 1528-1540.
Friede, T., Stallard, N., & Parsons, N. (2020). Adaptive seamless clinical trials using early outcomes for treatment or subgroup selection: Methods, simulation model and their implementation in R. Biometrical Journal, 62(5), 1264-1283.
Friedrich, S., Antes, G., Behr, S., Binder, H., Brannath, W., Dumpert, F., Ickstadt, K., Kestler, H. A., Lederer, J., Leitgöb, H., Pauly, M., Steland, A., Wilhelm, A., & Friede, T. (2021a). Is there a role for statistics in artificial intelligence? Advances in Data Analysis and Classification, 16, 823-846.
Friedrich, S., Beyersmann, J., Winterfeld, U., Schumacher, M., & Allignol, A. (2017a). Nonparametric estimation of pregnancy outcome probabilities. The Annals of Applied Statistics, 11(2), 840-867.
Friedrich, S., Brunner, E., & Pauly, M. (2017b). Permuting longitudinal data in spite of the dependencies. Journal of Multivariate Analysis, 153, 255-265.
Friedrich, S., & Friede, T. (2020). Causal inference methods for small non-randomized studies: Methods and an application in COVID-19. Contemporary Clinical Trials, 99, 106213.
Friedrich, S., Groß, S., König, I. R., Engelhardt, S., Bahls, M., Heinz, J., Huber, C., Kaderali, L., Kelm, M., Leha, A., Rühl, J., Schaller, J., Scherer, C., Vollmer, M., Seidler, T., & Friede, T. (2021b). Applications of artificial intelligence/machine learning approaches in cardiovascular medicine: A systematic review with recommendations. European Heart Journal - Digital Health, 2(3), 424-436.
Friedrich, S., Konietschke, F., & Pauly, M. (2017c). A wild bootstrap approach for nonparametric repeated measurements. Computational Statistics & Data Analysis, 113, 38-52.
Gautret, P., Lagier, J.-C., Parola, P., Hoang, V. T., Meddeb, L., Mailhe, M., Doudier, B., Courjon, J., Giordanengo, V., Vieira, V. E., Dupont, H. T., Honoré, S., Colson, P., Chabrière, E., Scola, B. L., Rolain, J.-M., Brouqui, P., & Raoult, D. (2020). Hydroxychloroquine and azithromycin as a treatment of COVID-19: results of an open-label non-randomized clinical trial. International Journal of Antimicrobial Agents, 56(1), 105949.
Gecgel, O., Ekwaro-Osire, S., Dias, J. P., Serwadda, A., Alemayehu, F. M., & Nispel, A. (2019). Gearbox fault diagnostics using deep learning with simulated data. In 2019 IEEE international conference on prognostics and health management (ICPHM). IEEE, 1-8.
Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (1995). Bayesian data analysis, Chapman and Hall/CRC.
Goldberger, A. L., Amaral, L. A. N., Glass, L., Hausdorff, J. M., PCh, I., Mark, R. G., Mietus, J. E., Moody, G. B., Peng, C.-K., & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation, 101(23), e215-e220.
Graf, R., Zeldovich, M., & Friedrich, S. (2022). Comparing linear discriminant analysis and supervised learning algorithms for binary classification-A method comparison study. Biometrical Journal.
Guyot, P., Ades, A. E., Ouwens, M. J. N. M., & Welton, N. J. (2012). Enhanced secondary analysis of survival data: Reconstructing the data from published Kaplan-Meier survival curves. BMC Medical Research Methodology, 12(1), 9.
Hesse-Biber, S. N. (2010). Mixed methods research: Merging theory with practice. Guilford Press.
Hothorn, T., Leisch, F., Zeileis, A., & Hornik, K. (2005). The design and analysis of benchmark experiments. Journal of Computational and Graphical Statistics, 14(3), 675-699.
Huber, C., Benda, N., & Friede, T. (2021). Subgroup identification in individual participant data meta-analysis using model-based recursive partitioning. Advances in Data Analysis and Classification, 16, 797-815.
Jiang, Y., Yin, S., Li, K., Luo, H., & Kaynak, O. (2021). Industrial applications of digital twins. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 379(2207), 20200360.
Jobjörnsson, S., Schaak, H., Musshoff, O., & Friede, T. (2022). Improving the statistical power of economic experiments using adaptive designs. Experimental Economics.
Kapoor, S., & Narayanan, A. (2022). Leakage and the reproducibility crisis in ML-based science. arXiv:2207.07048.
Kieser, M. (2016). Special Issue:Analysis of adverse event data. Pharmaceutical Statistics, 15(4), 287-289.
Klinkhammer, H., Staerk, C., Maj, C., Krawitz, P. M., & Mayr, A. (2023). A statistical boosting framework for polygenic risk scores based on large-scale genotype data. Frontiers in Genetics, 13, 1076440.
Koch, B., Denton, E., Hanna, A., & Foster, J. G. (2021). Reduced, reused and Recycled: The life of a dataset in machine learning research. arXiv preprint arXiv:2112.01716.
Kreutz, C. (2019). Guidelines for benchmarking of optimization-based approaches for fitting mathematical models. Genome Biology, 20(1), 281.
Michoel, T., Maere, S., Bonnet, E., Joshi, A., Saeys, Y., den Bulcke, T. V., Leemput, K. V., van Remortel, P., Kuiper, M., Marchal, K., & de Peer, Y. V. (2007). Validating module network learning algorithms using simulated data. BMC Bioinformatics, 8, S2.
Morita, S., Thall, P. F., & Müller, P. (2010). Evaluating the impact of prior assumptions in Bayesian biostatistics. Statistics in biosciences, 2(1), 1-17.
Morris, T. P., White, I. R., & Crowther, M. J. (2019). Using simulation studies to evaluate statistical methods. Statistics in Medicine, 38(11), 2074-2102.
Mozgunov, P., Paoletti, X., & Jaki, T. (2022). A benchmark for dose-finding studies with unknown ordering. Biostatistics, 23(3), 721-737.
Musuamba, F. T., Rusten, I. S., Lesage, R., Russo, G., Bursi, R., Emili, L., Wangorsch, G., Manolis, E., Karlsson, K. E., Kulesza, A., Courcelles, E., Boissel, J.-P., Rousseau, C. F., Voisin, E. M., Alessandrello, R., Curado, N., Dall'ara, E., Rodriguez, B., Pappalardo, F., & Geris, L. (2021). Scientific and regulatory evaluation of mechanistic in silico drug and disease models in drug development: Building model credibility. CPT: Pharmacometrics & Systems Pharmacology, 10(8), 804-825.
Mütze, T., Konietschke, F., Munk, A., & Friede, T. (2017). A studentized permutation test for three-arm trials in the “gold standard” design. Statistics in Medicine, 36(6), 883-898.
Mütze, T., Salem, S., Benda, N., Schmidli, H., & Friede, T. (2020). Blinded continuous information monitoring of recurrent event endpoints with time trends in clinical trials. Statistics in Medicine, 39(27), 3968-3985.
National Institute for Health and Care Research. (2022). Studies within a trial (SWAT) and studies within a review (SWAR). https://www.nihr.ac.uk/documents/studies-within-a-trial-swat/21512?pr=
National Library of Medicine. (2022). NIH Data Sharing Repositories; National Library of Medicine; National Institutes of Health; U.S. Department of Health and Human Services. https://www.nlm.nih.gov/NIHbmic/domain_specific_repositories.html
Nießl, C., Herrmann, M., Wiedemann, C., Casalicchio, G., & Boulesteix, A.-L. (2021). Over-optimism in benchmark studies and the multiplicity of design and analysis options when interpreting their results. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 12(2), e1441. https://doi.org/10.1002/widm.1441
O'Cathain, A., Murphy, E., & Nicholl, J. (2010). Three techniques for integrating data in mixed methods studies. BMJ, 341, c4587.
Ohneberg, K., Beyersmann, J., & Schumacher, M. (2019). Exposure density sampling: Dynamic matching with respect to a time-dependent exposure. Statistics in Medicine, 38(22), 4390-4403.
Pawel, S., Kook, L., & Reeve, K. (2022). Pitfalls and potentials in simulation studies. arXiv:2203.13076.
Proctor, T., & Schumacher, M. (2016). Analysing adverse events by time-to-event models: The CLEOPATRA study. Pharmaceutical Statistics, 15(4), 306-314.
Pylianidis, C., Osinga, S., & Athanasiadis, I. N. (2021). Introducing digital twins to agriculture. Computers and Electronics in Agriculture, 184, 105942.
Raji, I. D., Bender, E. M., Paullada, A., Denton, E., & Hanna, A. (2021). AI and the everything in the whole wide world benchmark. arXiv preprint arXiv:2111.15366.
Royston, P., Choodari-Oskooei, B., Parmar, M. K. B., & Rogers, J. K. (2019). Combined test versus logrank/Cox test in 50 randomised trials. Trials, 20(1), 172.
Seide, S. E., Röver, C., & Friede, T. (2019). Likelihood-based random-effects meta-analysis with few studies: Empirical and simulation studies. BMC Medical Research Methodology, 19(1), 16.
Stegherr, R., Beyersmann, J., Jehl, V., Rufibach, K., Leverkus, F., Schmoor, C., & Friede, T. (2020). Survival analysis for AdVerse events with VarYing follow-up times (SAVVY): Rationale and statistical concept of a meta-analytic study. Biometrical Journal, 63(3), 650-670.
Stegherr, R., Schmoor, C., Beyersmann, J., Rufibach, K., Jehl, V., Brückner, A., Eisele, L., Künzel, T., Kupas, K., Langer, F., Leverkus, F., Loos, A., Norenberg, C., Voss, F., & Friede, T. (2021a). Survival analysis for AdVerse events with VarYing follow-up times (SAVVY)-Estimation of adverse event risks. Trials, 22(1), 420.
Stegherr, R., Schmoor, C., Lübbert, M., Friede, T., & Beyersmann, J. (2021b). Estimating and comparing adverse event probabilities in the presence of varying follow-up times and competing events. Pharmaceutical Statistics, 20(6), 1125-1146.
Stone, M. (1974). Cross-Validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society: Series B (Methodological), 36(2), 111-133.
Strobl, C., & Leisch, F. (2022). Against the “one method fits all data sets” philosophy for comparison studies in methodological research. Biometrical Journal.
Strodthoff, N., Wagner, P., Schaeffter, T., & Samek, W. (2021). Deep learning for ECG analysis: Benchmarks and insights from PTB-XL. IEEE Journal of Biomedical and Health Informatics, 25(5), 1519-1528.
Sylvestre, M.-P., & Abrahamowicz, M. (2008). Comparison of algorithms to generate event times conditional on time-dependent covariates. Statistics in Medicine, 27(14), 2618-2634.
Sylvestre, M.-P., Evans, T., MacKenzie, T., & Abrahamowicz, M. (2010). PermAlgo: Permutational algorith to generate event times conditional on a covariate matrix including time-dependent covariates, R package version 1.2.
Thall, P. F., & Simon, R. (1994). Practical Bayesian guidelines for phase IIB clinical trials. Biometrics, 50(2), 337-349.
Thomas, J., Mayr, A., Bischl, B., Schmid, M., Smith, A., & Hofner, B. (2017). Gradient boosting for distributional regression: Faster tuning and improved variable selection via noncyclical updates. Statistics and Computing, 28(3), 673-687.
Turner, S. L., Karahalios, A., Forbes, A. B., Taljaard, M., Grimshaw, J. M., & McKenzie, J. E. (2021). Comparison of six statistical methods for interrupted time series studies: Empirical evaluation of 190 published series. BMC Medical Research Methodology, 21(1), 134.
Ullmann, T., Beer, A., Hünemörder, M., Seidl, T., & Boulesteix, A.-L. (2022). Over-optimistic evaluation and reporting of novel cluster algorithms: An illustrative study. Advances in Data Analysis and Classification.
Ullmann, T., Hennig, C., & Boulesteix, A.-L. (2021). Validation of cluster analysis results on validation data: A systematic framework. WIREs Data Mining and Knowledge Discovery, 12(3), e1444.
Van Mechelen, I., Boulesteix, A.-L., Dangl, R., Dean, N., Guyon, I., Hennig, C., Leisch, F., & Steinley, D. (2018). Benchmarking in cluster analysis: A white paper. arXiv preprint arXiv:1809.10496.
Vanschoren, J., & Yeung, S., (Eds.). (2021). Proceedings of the neural information processing systems track on datasets and benchmarks. https://datasets-benchmarks-proceedings.neurips.cc/paper/2021
Varga, A. N., Morel, A. E. G., Lokkerbol, J., van Dongen, J. M., van Tulder, M. W., & Bosmans, J. E. (2022). Dealing with confounding in observational studies: A scoping review of methods evaluated in simulation studies with single-point exposure. Statistics in Medicine, 42(4), 487-516.
Viceconti, M., Henney, A., & Morley-Fletcher, E. (2016). In silico clinical trials: How computer simulation will transform the biomedical industry. International Journal of Clinical Trials, 3(2), 37-46.
Voigt, I., Inojosa, H., Dillenseger, A., Haase, R., Akgün, K., & Ziemssen, T. (2021). Digital twins for multiple sclerosis. Frontiers in Immunology, 12, 669811.
Wang, B., Xie, W., Martagan, T., Akcay, A., & Corlu, C. G. (2019). Stochastic simulation model development for biopharmaceutical production process risk analysis and stability control. In 2019 winter simulation conference (WSC) (pp. 1989-2000). IEEE.
Weber, L. M., Saelens, W., Cannoodt, R., Soneson, C., Hapfelmeier, A., Gardner, P. P., Boulesteix, A.-L., Saeys, Y., & Robinson, M. D. (2019). Essential guidelines for computational method benchmarking. Genome Biology, 20(1), 125.
Wiksten, A., Rücker, G., & Schwarzer, G. (2016). Hartung-Knapp method is not always conservative compared with fixed-effect meta-analysis. Statistics in Medicine, 35(15), 2503-2515.
Xie, C., Jauhari, S., & Mora, A. (2021). Popularity and performance of bioinformatics software: The case of gene set analysis. BMC Bioinformatics, 22(1), 191.
Zimmermann, A. (2019). Method evaluation, parameterization, and result validation in unsupervised data mining: A critical survey. WIREs Data Mining and Knowledge Discovery, 10(2), e1330.