Enhancing the reliability and accuracy of AI-enabled diagnosis via complementarity-driven deferral to clinicians.


Journal

Nature medicine
ISSN: 1546-170X
Titre abrégé: Nat Med
Pays: United States
ID NLM: 9502015

Informations de publication

Date de publication:
07 2023
Historique:
received: 02 11 2022
accepted: 05 06 2023
medline: 21 7 2023
pubmed: 18 7 2023
entrez: 17 7 2023
Statut: ppublish

Résumé

Predictive artificial intelligence (AI) systems based on deep learning have been shown to achieve expert-level identification of diseases in multiple medical imaging settings, but can make errors in cases accurately diagnosed by clinicians and vice versa. We developed Complementarity-Driven Deferral to Clinical Workflow (CoDoC), a system that can learn to decide between the opinion of a predictive AI model and a clinical workflow. CoDoC enhances accuracy relative to clinician-only or AI-only baselines in clinical workflows that screen for breast cancer or tuberculosis (TB). For breast cancer screening, compared to double reading with arbitration in a screening program in the UK, CoDoC reduced false positives by 25% at the same false-negative rate, while achieving a 66% reduction in clinician workload. For TB triaging, compared to standalone AI and clinical workflows, CoDoC achieved a 5-15% reduction in false positives at the same false-negative rate for three of five commercially available predictive AI systems. To facilitate the deployment of CoDoC in novel futuristic clinical settings, we present results showing that CoDoC's performance gains are sustained across several axes of variation (imaging modality, clinical setting and predictive AI system) and discuss the limitations of our evaluation and where further validation would be needed. We provide an open-source implementation to encourage further research and application.

Identifiants

pubmed: 37460754
doi: 10.1038/s41591-023-02437-x
pii: 10.1038/s41591-023-02437-x
doi:

Types de publication

Journal Article Research Support, Non-U.S. Gov't

Langues

eng

Sous-ensembles de citation

IM

Pagination

1814-1820

Commentaires et corrections

Type : CommentIn

Informations de copyright

© 2023. The Author(s), under exclusive licence to Springer Nature America, Inc.

Références

Ruamviboonsuk, P. et al. Deep learning versus human graders for classifying diabetic retinopathy severity in a nationwide screening program. NPJ Digit. Med. 2, 25 (2019).
doi: 10.1038/s41746-019-0099-8 pmcid: 6550283
McKinney, S. M. et al. International evaluation of an AI system for breast cancer screening. Nature 577, 89–94 (2020).
doi: 10.1038/s41586-019-1799-6 pubmed: 31894144
Lee, C. S. & Lee, A. Y. Clinical applications of continual learning machine learning. Lancet Digit. Health 2, e279–e281 (2020).
doi: 10.1016/S2589-7500(20)30102-3 pubmed: 33328120 pmcid: 8259323
Shen, Y. et al. An interpretable classifier for high-resolution breast cancer screening images utilizing weakly supervised localization. Med. Image Anal. 68, 101908 (2021).
doi: 10.1016/j.media.2020.101908 pubmed: 33383334
Vokinger, K. N., Feuerriegel, S. & Kesselheim, A. S. Continual learning in medical devices: FDA’s action plan and beyond. Lancet Digit. Health 3, e337–e338 (2021).
doi: 10.1016/S2589-7500(21)00076-5 pubmed: 33933404
Hendrycks, D. & Gimpel, K. A baseline for detecting misclassified and out-of-distribution examples in neural networks. In Proceedings of International Conference on Learning Representations (ICLR) (OpenReview.net, 2017).
Leibig, C. et al. Combining the strengths of radiologists and AI for breast cancer screening: a retrospective analysis. Lancet Digit. Health 4, e507–e519 (2022).
doi: 10.1016/S2589-7500(22)00070-X pubmed: 35750400 pmcid: 9839981
D'Amour, A. et al. Underspecification presents challenges for credibility in modern machine learning. J. Mach. Learn. Res. 23, 1–61 (2022).
Ardila, D. et al. End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography. Nat. Med. 25, 954–961 (2019).
doi: 10.1038/s41591-019-0447-x pubmed: 31110349
Mustafa, B. et al. Supervised transfer learning at scale for medical imaging. Preprint at arXiv https://doi.org/10.48550/arXiv.2101.05913 (2021).
Azizi, S. et al. Big self-supervised models advance medical image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision 3478–3488 (IEEE Computer Society, 2021).
Stadnick, B. et al. Meta-repository of screening mammography classifiers. Preprint at arXiv https://doi.org/10.48550/arXiv.2108.04800 (2021).
Habib, S. S. et al. Evaluation of computer aided detection of tuberculosis on chest radiography among people with diabetes in Karachi Pakistan. Sci. Rep. 10, 6276 (2020).
doi: 10.1038/s41598-020-63084-7 pubmed: 32286389 pmcid: 7156514
Subbaswamy, A. & Saria, S. From development to deployment: dataset shift, causality, and shift-stable models in health AI. Biostatistics 21, 345–352 (2020).
pubmed: 31742354
Freeman, K. et al. Use of artificial intelligence for image analysis in breast cancer screening programmes: systematic review of test accuracy. BMJ 374, n1872 (2021).
doi: 10.1136/bmj.n1872 pubmed: 34470740 pmcid: 8409323
Guidance on Screening and Symptomatic Breast Imaging 4th edn https://www.rcr.ac.uk/system/files/publication/field_publication_files/bfcr199-guidance-on-screening-and-symptomatic-breast-imaging_0.pdf (Royal College of Radiology, 2019).
European Commission. Use of double reading in mammography screening. https://healthcare-quality.jrc.ec.europa.eu/european-breast-cancer-guidelines/organisation-of-screening-programme/how-mammography-should-be-read (2019).
UK.GOV. Breast screening: quality assurance standards in radiology. https://www.gov.uk/government/publications/breast-screening-quality-assurance-standards-in-radiology (2011).
Sharma, N. et al. Large-scale evaluation of an AI system as an independent reader for double reading in breast cancer screening. Preprint at medRxiv https://doi.org/10.1101/2021.02.26.21252537 (2022).
Janssen, N., Rodriguez-Ruiz, A., Mieskes, C., Karssemeijer, N. & Heywang-Köbrunner, S. H. The potential of AI to replace a first reader in a double reading breast cancer screening program: a feasibility study. ScreenPoint Medical https://screenpoint-medical.com/evidence/the-potential-of-ai-to-replace-a-first-reader-in-a-double-reading-breast-cancer-screening-program-a-feasibility-study/ (2021).
Larsen, M. et al. Artificial intelligence evaluation of 122 969 mammography examinations from a population-based screening program. Radiology https://doi.org/10.1148/radiol.212381 (2022).
Qin, Z. Z. et al. Early user experience andlessons learned using ultra-portable digital X-ray with computer-aided detection (DXR-CAD) products: a qualitative study from the perspective of healthcare providers. PLoS ONE 18, e0277843 (2023).
Gaube, S. et al. Do as AI say: susceptibility in deployment of clinical decision-aids. NPJ Digit. Med. 4, 31 (2021).
doi: 10.1038/s41746-021-00385-9 pubmed: 33608629 pmcid: 7896064
Oakden-Rayner, L. & Palmer, L. Docs are ROCs: a simple off-the-shelf approach for estimating average human performance in diagnostic studies. Preprint at arXiv https://doi.org/10.48550/arXiv.2009.11060 (2020).
Silverman, B. W. Algorithm AS 176: kernel density estimation using the fast Fourier transform. Appl. Stat. 31, 93 (1982).
doi: 10.2307/2347084
Hall, P. & Wand, M. P. On the accuracy of binned kernel density estimators. J. Multivar. Anal. 56, 165–184 (1996).
doi: 10.1006/jmva.1996.0009
Silverman, B. W. Density Estimation for Statistics and Data Analysis (Chapman & Hall, 1986).
Fan, J. & Marron, J. S. Fast implementations of nonparametric curve estimators. J. Comput. Graph. Stat. 3, 35–56 (1994).
Liu, J.-P., Hsueh, H.-M., Hsieh, E. & Chen, J. J. Tests for equivalence or non-inferiority for paired binary data. Stat. Med. 21, 231–245 (2002).
McNemar, Q. Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika 12, 153–157 (1947).
doi: 10.1007/BF02295996 pubmed: 20254758
Fagerland, M. W., Lydersen, S. & Laake, P. Recommended tests and confidence intervals for paired binomial proportions. Stat. Med. 33, 2850–2875 (2014).
doi: 10.1002/sim.6148 pubmed: 24648355
Aickin, M. & Gensler, H. Adjusting for multiple testing when reporting research results: the Bonferroni vs Holm methods. Am. J. Public Health 86, 726–728 (1996).
doi: 10.2105/AJPH.86.5.726 pubmed: 8629727 pmcid: 1380484
Mozannar, H. & Sontag, D. Consistent estimators for learning to defer to an expert. Preprint at arXiv https://doi.org/10.48550/arXiv.2006.01862 (2021).
Wilder, B., Horvitz, E. & Kamar, E. Learning to complement humans. In Proceedings of the 29th International Joint Conference on Artificial Intelligence (Ed. Bessiere, C.) 1526–1533 (International Joint Conferences on Artificial Intelligence Organization, 2020); https://doi.org/10.24963/ijcai.2020/212
Raghu, M. et al. The algorithmic automation problem: prediction, triage, and human effort. Preprint at arXiv https://doi.org/10.48550/arXiv.1903.12220 (2019).
Charusaie, M.-A., Mozannar, H., Sontag, D. & Samadi, S. Sample efficient learning of predictors that complement humans. In Proceedings of the 39th International Conference on Machine Learning (Eds. Chaudhuri. K. et al.) 2972–3005 (PMLR, 2022).
Narasimhan, H., Jitkrittum, W., Menon, A. K., Rawat, A. S. & Kumar, S. Post-hoc estimators for learning to defer to an expert. In Advances in Neural Information Processing Systems (Eds. Koyejo, S. et al.) 29292–29304 (Curran Associates, 2022).
Kerrigan, G., Smyth, P. & Steyvers, M. Combining human predictions with model probabilities via confusion matrices and calibration. In Advances in Neural Information Processing Systems Vol. 34 (Eds. Ranzato, M. A. et al.) 4421–4434 (Curran Associates, Inc., 2021).
Qin, Z. Z. et al. Tuberculosis detection from chest x-rays for triaging in a high tuberculosis-burden setting: an evaluation of five artificial intelligence algorithms. Lancet Digit. Health 3, e543–e554 (2021).
doi: 10.1016/S2589-7500(21)00116-3 pubmed: 34446265

Auteurs

Krishnamurthy Dj Dvijotham (KD)

Google DeepMind, Mountain View, CA, USA. dvij@cs.washington.edu.

Jim Winkens (J)

Google Research, New York, NY, USA. jimwinkens@google.com.

Melih Barsbey (M)

Bogazici University, Istanbul, Turkey.

Sumedh Ghaisas (S)

Google DeepMind, London, UK.

Robert Stanforth (R)

Google DeepMind, London, UK.

Nick Pawlowski (N)

Microsoft Research, Cambridge, UK.

Patricia Strachan (P)

Google Research, London, UK.

Zahra Ahmed (Z)

Google DeepMind, London, UK.

Shekoofeh Azizi (S)

Google DeepMind, Toronto, Ontario, Canada.

Yoram Bachrach (Y)

Google DeepMind, London, UK.

Laura Culp (L)

Google DeepMind, Toronto, Ontario, Canada.

Mayank Daswani (M)

Google Research, London, UK.

Jan Freyberg (J)

Google Research, London, UK.

Atilla Kiraly (A)

Google Research, Palo Alto, CA, USA.

Timo Kohlberger (T)

Google Research, Palo Alto, CA, USA.

Scott McKinney (S)

OpenAI, San Francisco, CA, USA.

Basil Mustafa (B)

Google DeepMind, Zurich, Switzerland.

Vivek Natarajan (V)

Google Research, Palo Alto, CA, USA.

Krzysztof Geras (K)

NYU Grossman School of Medicine, New York, NY, USA.

Jan Witowski (J)

NYU Grossman School of Medicine, New York, NY, USA.

Zhi Zhen Qin (ZZ)

Stop TB Partnership, Geneva, Switzerland.

Jacob Creswell (J)

Stop TB Partnership, Geneva, Switzerland.

Shravya Shetty (S)

Google Research, Palo Alto, CA, USA.

Marcin Sieniek (M)

Google Research, Palo Alto, CA, USA.

Terry Spitz (T)

Google Research, London, UK.

Greg Corrado (G)

Google Research, Palo Alto, CA, USA.

Pushmeet Kohli (P)

Google DeepMind, London, UK.

Taylan Cemgil (T)

Google DeepMind, London, UK.

Alan Karthikesalingam (A)

Google Research, London, UK.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH