Enhancing the reliability and accuracy of AI-enabled diagnosis via complementarity-driven deferral to clinicians.
Journal
Nature medicine
ISSN: 1546-170X
Titre abrégé: Nat Med
Pays: United States
ID NLM: 9502015
Informations de publication
Date de publication:
07 2023
07 2023
Historique:
received:
02
11
2022
accepted:
05
06
2023
medline:
21
7
2023
pubmed:
18
7
2023
entrez:
17
7
2023
Statut:
ppublish
Résumé
Predictive artificial intelligence (AI) systems based on deep learning have been shown to achieve expert-level identification of diseases in multiple medical imaging settings, but can make errors in cases accurately diagnosed by clinicians and vice versa. We developed Complementarity-Driven Deferral to Clinical Workflow (CoDoC), a system that can learn to decide between the opinion of a predictive AI model and a clinical workflow. CoDoC enhances accuracy relative to clinician-only or AI-only baselines in clinical workflows that screen for breast cancer or tuberculosis (TB). For breast cancer screening, compared to double reading with arbitration in a screening program in the UK, CoDoC reduced false positives by 25% at the same false-negative rate, while achieving a 66% reduction in clinician workload. For TB triaging, compared to standalone AI and clinical workflows, CoDoC achieved a 5-15% reduction in false positives at the same false-negative rate for three of five commercially available predictive AI systems. To facilitate the deployment of CoDoC in novel futuristic clinical settings, we present results showing that CoDoC's performance gains are sustained across several axes of variation (imaging modality, clinical setting and predictive AI system) and discuss the limitations of our evaluation and where further validation would be needed. We provide an open-source implementation to encourage further research and application.
Identifiants
pubmed: 37460754
doi: 10.1038/s41591-023-02437-x
pii: 10.1038/s41591-023-02437-x
doi:
Types de publication
Journal Article
Research Support, Non-U.S. Gov't
Langues
eng
Sous-ensembles de citation
IM
Pagination
1814-1820Commentaires et corrections
Type : CommentIn
Informations de copyright
© 2023. The Author(s), under exclusive licence to Springer Nature America, Inc.
Références
Ruamviboonsuk, P. et al. Deep learning versus human graders for classifying diabetic retinopathy severity in a nationwide screening program. NPJ Digit. Med. 2, 25 (2019).
doi: 10.1038/s41746-019-0099-8
pmcid: 6550283
McKinney, S. M. et al. International evaluation of an AI system for breast cancer screening. Nature 577, 89–94 (2020).
doi: 10.1038/s41586-019-1799-6
pubmed: 31894144
Lee, C. S. & Lee, A. Y. Clinical applications of continual learning machine learning. Lancet Digit. Health 2, e279–e281 (2020).
doi: 10.1016/S2589-7500(20)30102-3
pubmed: 33328120
pmcid: 8259323
Shen, Y. et al. An interpretable classifier for high-resolution breast cancer screening images utilizing weakly supervised localization. Med. Image Anal. 68, 101908 (2021).
doi: 10.1016/j.media.2020.101908
pubmed: 33383334
Vokinger, K. N., Feuerriegel, S. & Kesselheim, A. S. Continual learning in medical devices: FDA’s action plan and beyond. Lancet Digit. Health 3, e337–e338 (2021).
doi: 10.1016/S2589-7500(21)00076-5
pubmed: 33933404
Hendrycks, D. & Gimpel, K. A baseline for detecting misclassified and out-of-distribution examples in neural networks. In Proceedings of International Conference on Learning Representations (ICLR) (OpenReview.net, 2017).
Leibig, C. et al. Combining the strengths of radiologists and AI for breast cancer screening: a retrospective analysis. Lancet Digit. Health 4, e507–e519 (2022).
doi: 10.1016/S2589-7500(22)00070-X
pubmed: 35750400
pmcid: 9839981
D'Amour, A. et al. Underspecification presents challenges for credibility in modern machine learning. J. Mach. Learn. Res. 23, 1–61 (2022).
Ardila, D. et al. End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography. Nat. Med. 25, 954–961 (2019).
doi: 10.1038/s41591-019-0447-x
pubmed: 31110349
Mustafa, B. et al. Supervised transfer learning at scale for medical imaging. Preprint at arXiv https://doi.org/10.48550/arXiv.2101.05913 (2021).
Azizi, S. et al. Big self-supervised models advance medical image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision 3478–3488 (IEEE Computer Society, 2021).
Stadnick, B. et al. Meta-repository of screening mammography classifiers. Preprint at arXiv https://doi.org/10.48550/arXiv.2108.04800 (2021).
Habib, S. S. et al. Evaluation of computer aided detection of tuberculosis on chest radiography among people with diabetes in Karachi Pakistan. Sci. Rep. 10, 6276 (2020).
doi: 10.1038/s41598-020-63084-7
pubmed: 32286389
pmcid: 7156514
Subbaswamy, A. & Saria, S. From development to deployment: dataset shift, causality, and shift-stable models in health AI. Biostatistics 21, 345–352 (2020).
pubmed: 31742354
Freeman, K. et al. Use of artificial intelligence for image analysis in breast cancer screening programmes: systematic review of test accuracy. BMJ 374, n1872 (2021).
doi: 10.1136/bmj.n1872
pubmed: 34470740
pmcid: 8409323
Guidance on Screening and Symptomatic Breast Imaging 4th edn https://www.rcr.ac.uk/system/files/publication/field_publication_files/bfcr199-guidance-on-screening-and-symptomatic-breast-imaging_0.pdf (Royal College of Radiology, 2019).
European Commission. Use of double reading in mammography screening. https://healthcare-quality.jrc.ec.europa.eu/european-breast-cancer-guidelines/organisation-of-screening-programme/how-mammography-should-be-read (2019).
UK.GOV. Breast screening: quality assurance standards in radiology. https://www.gov.uk/government/publications/breast-screening-quality-assurance-standards-in-radiology (2011).
Sharma, N. et al. Large-scale evaluation of an AI system as an independent reader for double reading in breast cancer screening. Preprint at medRxiv https://doi.org/10.1101/2021.02.26.21252537 (2022).
Janssen, N., Rodriguez-Ruiz, A., Mieskes, C., Karssemeijer, N. & Heywang-Köbrunner, S. H. The potential of AI to replace a first reader in a double reading breast cancer screening program: a feasibility study. ScreenPoint Medical https://screenpoint-medical.com/evidence/the-potential-of-ai-to-replace-a-first-reader-in-a-double-reading-breast-cancer-screening-program-a-feasibility-study/ (2021).
Larsen, M. et al. Artificial intelligence evaluation of 122 969 mammography examinations from a population-based screening program. Radiology https://doi.org/10.1148/radiol.212381 (2022).
Qin, Z. Z. et al. Early user experience andlessons learned using ultra-portable digital X-ray with computer-aided detection (DXR-CAD) products: a qualitative study from the perspective of healthcare providers. PLoS ONE 18, e0277843 (2023).
Gaube, S. et al. Do as AI say: susceptibility in deployment of clinical decision-aids. NPJ Digit. Med. 4, 31 (2021).
doi: 10.1038/s41746-021-00385-9
pubmed: 33608629
pmcid: 7896064
Oakden-Rayner, L. & Palmer, L. Docs are ROCs: a simple off-the-shelf approach for estimating average human performance in diagnostic studies. Preprint at arXiv https://doi.org/10.48550/arXiv.2009.11060 (2020).
Silverman, B. W. Algorithm AS 176: kernel density estimation using the fast Fourier transform. Appl. Stat. 31, 93 (1982).
doi: 10.2307/2347084
Hall, P. & Wand, M. P. On the accuracy of binned kernel density estimators. J. Multivar. Anal. 56, 165–184 (1996).
doi: 10.1006/jmva.1996.0009
Silverman, B. W. Density Estimation for Statistics and Data Analysis (Chapman & Hall, 1986).
Fan, J. & Marron, J. S. Fast implementations of nonparametric curve estimators. J. Comput. Graph. Stat. 3, 35–56 (1994).
Liu, J.-P., Hsueh, H.-M., Hsieh, E. & Chen, J. J. Tests for equivalence or non-inferiority for paired binary data. Stat. Med. 21, 231–245 (2002).
McNemar, Q. Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika 12, 153–157 (1947).
doi: 10.1007/BF02295996
pubmed: 20254758
Fagerland, M. W., Lydersen, S. & Laake, P. Recommended tests and confidence intervals for paired binomial proportions. Stat. Med. 33, 2850–2875 (2014).
doi: 10.1002/sim.6148
pubmed: 24648355
Aickin, M. & Gensler, H. Adjusting for multiple testing when reporting research results: the Bonferroni vs Holm methods. Am. J. Public Health 86, 726–728 (1996).
doi: 10.2105/AJPH.86.5.726
pubmed: 8629727
pmcid: 1380484
Mozannar, H. & Sontag, D. Consistent estimators for learning to defer to an expert. Preprint at arXiv https://doi.org/10.48550/arXiv.2006.01862 (2021).
Wilder, B., Horvitz, E. & Kamar, E. Learning to complement humans. In Proceedings of the 29th International Joint Conference on Artificial Intelligence (Ed. Bessiere, C.) 1526–1533 (International Joint Conferences on Artificial Intelligence Organization, 2020); https://doi.org/10.24963/ijcai.2020/212
Raghu, M. et al. The algorithmic automation problem: prediction, triage, and human effort. Preprint at arXiv https://doi.org/10.48550/arXiv.1903.12220 (2019).
Charusaie, M.-A., Mozannar, H., Sontag, D. & Samadi, S. Sample efficient learning of predictors that complement humans. In Proceedings of the 39th International Conference on Machine Learning (Eds. Chaudhuri. K. et al.) 2972–3005 (PMLR, 2022).
Narasimhan, H., Jitkrittum, W., Menon, A. K., Rawat, A. S. & Kumar, S. Post-hoc estimators for learning to defer to an expert. In Advances in Neural Information Processing Systems (Eds. Koyejo, S. et al.) 29292–29304 (Curran Associates, 2022).
Kerrigan, G., Smyth, P. & Steyvers, M. Combining human predictions with model probabilities via confusion matrices and calibration. In Advances in Neural Information Processing Systems Vol. 34 (Eds. Ranzato, M. A. et al.) 4421–4434 (Curran Associates, Inc., 2021).
Qin, Z. Z. et al. Tuberculosis detection from chest x-rays for triaging in a high tuberculosis-burden setting: an evaluation of five artificial intelligence algorithms. Lancet Digit. Health 3, e543–e554 (2021).
doi: 10.1016/S2589-7500(21)00116-3
pubmed: 34446265