SurvdigitizeR: an algorithm for automated survival curve digitization.

Automated Digitization Kaplan–Meier Curve Meta-analysis R Package Shiny application Survival Analysis

Journal

BMC medical research methodology
ISSN: 1471-2288
Titre abrégé: BMC Med Res Methodol
Pays: England
ID NLM: 100968545

Informations de publication

Date de publication:
13 Jul 2024
Historique:
received: 21 03 2024
accepted: 02 07 2024
medline: 14 7 2024
pubmed: 14 7 2024
entrez: 13 7 2024
Statut: epublish

Résumé

Decision analytic models and meta-analyses often rely on survival probabilities that are digitized from published Kaplan-Meier (KM) curves. However, manually extracting these probabilities from KM curves is time-consuming, expensive, and error-prone. We developed an efficient and accurate algorithm that automates extraction of survival probabilities from KM curves. The automated digitization algorithm processes images from a JPG or PNG format, converts them in their hue, saturation, and lightness scale and uses optical character recognition to detect axis location and labels. It also uses a k-medoids clustering algorithm to separate multiple overlapping curves on the same figure. To validate performance, we generated survival plots form random time-to-event data from a sample size of 25, 50, 150, and 250, 1000 individuals split into 1,2, or 3 treatment arms. We assumed an exponential distribution and applied random censoring. We compared automated digitization and manual digitization performed by well-trained researchers. We calculated the root mean squared error (RMSE) at 100-time points for both methods. The algorithm's performance was also evaluated by Bland-Altman analysis for the agreement between automated and manual digitization on a real-world set of published KM curves. The automated digitizer accurately identified survival probabilities over time in the simulated KM curves. The average RMSE for automated digitization was 0.012, while manual digitization had an average RMSE of 0.014. Its performance was negatively correlated with the number of curves in a figure and the presence of censoring markers. In real-world scenarios, automated digitization and manual digitization showed very close agreement. The algorithm streamlines the digitization process and requires minimal user input. It effectively digitized KM curves in simulated and real-world scenarios, demonstrating accuracy comparable to conventional manual digitization. The algorithm has been developed as an open-source R package and as a Shiny application and is available on GitHub: https://github.com/Pechli-Lab/SurvdigitizeR and https://pechlilab.shinyapps.io/SurvdigitizeR/ .

Sections du résumé

BACKGROUND BACKGROUND
Decision analytic models and meta-analyses often rely on survival probabilities that are digitized from published Kaplan-Meier (KM) curves. However, manually extracting these probabilities from KM curves is time-consuming, expensive, and error-prone. We developed an efficient and accurate algorithm that automates extraction of survival probabilities from KM curves.
METHODS METHODS
The automated digitization algorithm processes images from a JPG or PNG format, converts them in their hue, saturation, and lightness scale and uses optical character recognition to detect axis location and labels. It also uses a k-medoids clustering algorithm to separate multiple overlapping curves on the same figure. To validate performance, we generated survival plots form random time-to-event data from a sample size of 25, 50, 150, and 250, 1000 individuals split into 1,2, or 3 treatment arms. We assumed an exponential distribution and applied random censoring. We compared automated digitization and manual digitization performed by well-trained researchers. We calculated the root mean squared error (RMSE) at 100-time points for both methods. The algorithm's performance was also evaluated by Bland-Altman analysis for the agreement between automated and manual digitization on a real-world set of published KM curves.
RESULTS RESULTS
The automated digitizer accurately identified survival probabilities over time in the simulated KM curves. The average RMSE for automated digitization was 0.012, while manual digitization had an average RMSE of 0.014. Its performance was negatively correlated with the number of curves in a figure and the presence of censoring markers. In real-world scenarios, automated digitization and manual digitization showed very close agreement.
CONCLUSIONS CONCLUSIONS
The algorithm streamlines the digitization process and requires minimal user input. It effectively digitized KM curves in simulated and real-world scenarios, demonstrating accuracy comparable to conventional manual digitization. The algorithm has been developed as an open-source R package and as a Shiny application and is available on GitHub: https://github.com/Pechli-Lab/SurvdigitizeR and https://pechlilab.shinyapps.io/SurvdigitizeR/ .

Identifiants

pubmed: 39003440
doi: 10.1186/s12874-024-02273-8
pii: 10.1186/s12874-024-02273-8
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

147

Informations de copyright

© 2024. The Author(s).

Références

Briggs A, Claxton K, Sculpher M. Decision modelling for health economic evaluation. Oxford, New York: Oxford University Press; 2006.
doi: 10.1093/oso/9780198526629.001.0001
Dias S, Ades T, Welton NJ, Jansen J, Sutton AJ. Network meta-analysis for decision-making. United States: Wiley-Blackwell; 2018.
doi: 10.1002/9781118951651
Gallacher D, Kimani P, Stallard N. Extrapolating parametric survival models in health technology assessment: a simulation study. Med Decis Making. 2021;41:37–50.
doi: 10.1177/0272989X20973201 pubmed: 33283635
Golub J. Survival Analysis and European Union Decision-making. Eur Union Polit. 2007;8:155–79.
doi: 10.1177/1465116507076428
Guyot P, Ades A, Ouwens MJ, Welton NJ. Enhanced secondary analysis of survival data: reconstructing the data from published Kaplan-Meier survival curves. BMC Med Res Methodol. 2012;12:9.
doi: 10.1186/1471-2288-12-9 pubmed: 22297116 pmcid: 3313891
Liu N, Zhou Y, Lee JJ. IPDfromKM: reconstruct individual patient data from published Kaplan-Meier survival curves. BMC Med Res Methodol. 2021;21:111.
doi: 10.1186/s12874-021-01308-8 pubmed: 34074267 pmcid: 8168323
Bormann I. DigitizeIt 2.4. 2013. Available from: https://www.digitizeit.de/
Mitchell M, Muftakhidinov B, Winchen T, Wilms A, Schaik BV, Badshah400, et al. Engauge Digitizer Software. Zenodo; 2020. Available from: https://zenodo.org/record/3941227 . Cited 2024 Feb 26.
Therneau TM, Grambsch PM. Modeling Survival Data: Extending the Cox Model. New York, NY: Springer; 2000. Available from: http://link.springer.com/10.1007/978-1-4757-3294-8 . Cited 2023 Mar 31.
Kassambara A, Kosinski M, Biecek P, Fabian S. survminer: Drawing Survival Curves using “ggplot2.”. 2021. Available from: https://rpkgs.datanovia.com/survminer/index.html .
Giavarina D. Understanding Bland Altman analysis. Biochem. Medica. 2015;25:141–51.
Kendall MG. A New Measure of Rank Correlation. Biometrika. 1938;30:81–93.
doi: 10.1093/biomet/30.1-2.81
Zhang M-J, Zhang X, Scheike TH. Modeling cumulative incidence function for competing risks data. Expert Rev Clin Pharmacol. 2008;1:391–400.
doi: 10.1586/17512433.1.3.391 pubmed: 19829754 pmcid: 2760993
Liu Z, Rich B, Hanley JA. Recovering the raw data behind a non-parametric survival curve. Syst Rev. 2014;3:151.
doi: 10.1186/2046-4053-3-151 pubmed: 25551437 pmcid: 4293001
Ooms J. tesseract: Open Source OCR Engine. 2023. Available from: https://docs.ropensci.org/tesseract/ (website) https://github.com/ropensci/tesseract (devel).
Park H-S, Jun C-H. A simple and fast algorithm for K-medoids clustering. Expert Syst Appl. 2009;36:3336–41.
doi: 10.1016/j.eswa.2008.01.039
Uddin S, Haque I, Lu H, Moni MA, Gide E. Comparative performance analysis of K-nearest neighbour (KNN) algorithm and its different variants for disease prediction. Sci Rep. 2022;12:6256.
doi: 10.1038/s41598-022-10358-x pubmed: 35428863 pmcid: 9012855
Baio G. survHE: Survival analysis for health economic evaluation and cost-effectiveness modeling. J Stat Softw. 2020;95:1–47.
doi: 10.18637/jss.v095.i14

Auteurs

Jasper Zhongyuan Zhang (JZ)

Child Health Evaluative Sciences, Peter Gilgan Centre for Research and Learning, The Hospital for Sick Children, Toronto, ON, Canada.
Biostatistics Division, Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada.

Juan David Rios (JD)

Child Health Evaluative Sciences, Peter Gilgan Centre for Research and Learning, The Hospital for Sick Children, Toronto, ON, Canada.

Tilemanchos Pechlivanoglou (T)

Lassonde School of Engineering, York University, Toronto, ON, Canada.

Alan Yang (A)

Child Health Evaluative Sciences, Peter Gilgan Centre for Research and Learning, The Hospital for Sick Children, Toronto, ON, Canada.

Qiyue Zhang (Q)

Child Health Evaluative Sciences, Peter Gilgan Centre for Research and Learning, The Hospital for Sick Children, Toronto, ON, Canada.

Dimitrios Deris (D)

Michael G. DeGroote School of Medicine, McMaster University, Hamilton, ON, Canada.

Ian Cromwell (I)

Canada's Drug Agency, Ottawa, ON, Canada.

Petros Pechlivanoglou (P)

Child Health Evaluative Sciences, Peter Gilgan Centre for Research and Learning, The Hospital for Sick Children, Toronto, ON, Canada. petros.pechlivanoglou@sickkids.ca.
Institute of Health Policy, Management and Evaluation, Dalla Lana School of Public Health, University of Toronto, Toronto, ON, Canada. petros.pechlivanoglou@sickkids.ca.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH