Enhancing the reliability and accuracy of AI-enabled diagnosis via complementarity-driven deferral to clinicians.

Artificial Intelligence Reproducibility of Results Triage Workflow Humans

Journal

Nature medicine

ISSN: 1546-170X

Titre abrégé: Nat Med

Pays: United States

ID NLM: 9502015

Informations de publication

Date de publication:
07 2023

Historique:

received: 02 11 2022

accepted: 05 06 2023

medline: 21 7 2023

pubmed: 18 7 2023

entrez: 17 7 2023

Statut: ppublish

Résumé

Predictive artificial intelligence (AI) systems based on deep learning have been shown to achieve expert-level identification of diseases in multiple medical imaging settings, but can make errors in cases accurately diagnosed by clinicians and vice versa. We developed Complementarity-Driven Deferral to Clinical Workflow (CoDoC), a system that can learn to decide between the opinion of a predictive AI model and a clinical workflow. CoDoC enhances accuracy relative to clinician-only or AI-only baselines in clinical workflows that screen for breast cancer or tuberculosis (TB). For breast cancer screening, compared to double reading with arbitration in a screening program in the UK, CoDoC reduced false positives by 25% at the same false-negative rate, while achieving a 66% reduction in clinician workload. For TB triaging, compared to standalone AI and clinical workflows, CoDoC achieved a 5-15% reduction in false positives at the same false-negative rate for three of five commercially available predictive AI systems. To facilitate the deployment of CoDoC in novel futuristic clinical settings, we present results showing that CoDoC's performance gains are sustained across several axes of variation (imaging modality, clinical setting and predictive AI system) and discuss the limitations of our evaluation and where further validation would be needed. We provide an open-source implementation to encourage further research and application.

Identifiants

DOI: 10.1038/s41591-023-02437-x PMID: 37460754

pubmed: 37460754

doi: 10.1038/s41591-023-02437-x

pii: 10.1038/s41591-023-02437-x

doi:

Types de publication

Journal Article Research Support, Non-U.S. Gov't

Langues

eng

Sous-ensembles de citation

Pagination

1814-1820

Commentaires et corrections

Type : CommentIn

Informations de copyright

Références

Ruamviboonsuk, P. et al. Deep learning versus human graders for classifying diabetic retinopathy severity in a nationwide screening program. NPJ Digit. Med. 2, 25 (2019).

doi: 10.1038/s41746-019-0099-8 pmcid: 6550283

McKinney, S. M. et al. International evaluation of an AI system for breast cancer screening. Nature 577, 89–94 (2020).

doi: 10.1038/s41586-019-1799-6 pubmed: 31894144

Lee, C. S. & Lee, A. Y. Clinical applications of continual learning machine learning. Lancet Digit. Health 2, e279–e281 (2020).

doi: 10.1016/S2589-7500(20)30102-3 pubmed: 33328120 pmcid: 8259323

Shen, Y. et al. An interpretable classifier for high-resolution breast cancer screening images utilizing weakly supervised localization. Med. Image Anal. 68, 101908 (2021).

doi: 10.1016/j.media.2020.101908 pubmed: 33383334

Vokinger, K. N., Feuerriegel, S. & Kesselheim, A. S. Continual learning in medical devices: FDA’s action plan and beyond. Lancet Digit. Health 3, e337–e338 (2021).

doi: 10.1016/S2589-7500(21)00076-5 pubmed: 33933404

Hendrycks, D. & Gimpel, K. A baseline for detecting misclassified and out-of-distribution examples in neural networks. In Proceedings of International Conference on Learning Representations (ICLR) (OpenReview.net, 2017).

Leibig, C. et al. Combining the strengths of radiologists and AI for breast cancer screening: a retrospective analysis. Lancet Digit. Health 4, e507–e519 (2022).

doi: 10.1016/S2589-7500(22)00070-X pubmed: 35750400 pmcid: 9839981

D'Amour, A. et al. Underspecification presents challenges for credibility in modern machine learning. J. Mach. Learn. Res. 23, 1–61 (2022).

Ardila, D. et al. End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography. Nat. Med. 25, 954–961 (2019).

doi: 10.1038/s41591-019-0447-x pubmed: 31110349

Mustafa, B. et al. Supervised transfer learning at scale for medical imaging. Preprint at arXiv https://doi.org/10.48550/arXiv.2101.05913 (2021).

Azizi, S. et al. Big self-supervised models advance medical image classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision 3478–3488 (IEEE Computer Society, 2021).

Stadnick, B. et al. Meta-repository of screening mammography classifiers. Preprint at arXiv https://doi.org/10.48550/arXiv.2108.04800 (2021).

Habib, S. S. et al. Evaluation of computer aided detection of tuberculosis on chest radiography among people with diabetes in Karachi Pakistan. Sci. Rep. 10, 6276 (2020).

doi: 10.1038/s41598-020-63084-7 pubmed: 32286389 pmcid: 7156514

Subbaswamy, A. & Saria, S. From development to deployment: dataset shift, causality, and shift-stable models in health AI. Biostatistics 21, 345–352 (2020).

pubmed: 31742354

Freeman, K. et al. Use of artificial intelligence for image analysis in breast cancer screening programmes: systematic review of test accuracy. BMJ 374, n1872 (2021).

doi: 10.1136/bmj.n1872 pubmed: 34470740 pmcid: 8409323

Guidance on Screening and Symptomatic Breast Imaging 4th edn https://www.rcr.ac.uk/system/files/publication/field_publication_files/bfcr199-guidance-on-screening-and-symptomatic-breast-imaging_0.pdf (Royal College of Radiology, 2019).

European Commission. Use of double reading in mammography screening. https://healthcare-quality.jrc.ec.europa.eu/european-breast-cancer-guidelines/organisation-of-screening-programme/how-mammography-should-be-read (2019).

UK.GOV. Breast screening: quality assurance standards in radiology. https://www.gov.uk/government/publications/breast-screening-quality-assurance-standards-in-radiology (2011).

Sharma, N. et al. Large-scale evaluation of an AI system as an independent reader for double reading in breast cancer screening. Preprint at medRxiv https://doi.org/10.1101/2021.02.26.21252537 (2022).

Janssen, N., Rodriguez-Ruiz, A., Mieskes, C., Karssemeijer, N. & Heywang-Köbrunner, S. H. The potential of AI to replace a first reader in a double reading breast cancer screening program: a feasibility study. ScreenPoint Medical https://screenpoint-medical.com/evidence/the-potential-of-ai-to-replace-a-first-reader-in-a-double-reading-breast-cancer-screening-program-a-feasibility-study/ (2021).

Larsen, M. et al. Artificial intelligence evaluation of 122 969 mammography examinations from a population-based screening program. Radiology https://doi.org/10.1148/radiol.212381 (2022).

Qin, Z. Z. et al. Early user experience andlessons learned using ultra-portable digital X-ray with computer-aided detection (DXR-CAD) products: a qualitative study from the perspective of healthcare providers. PLoS ONE 18, e0277843 (2023).

Gaube, S. et al. Do as AI say: susceptibility in deployment of clinical decision-aids. NPJ Digit. Med. 4, 31 (2021).

doi: 10.1038/s41746-021-00385-9 pubmed: 33608629 pmcid: 7896064

Oakden-Rayner, L. & Palmer, L. Docs are ROCs: a simple off-the-shelf approach for estimating average human performance in diagnostic studies. Preprint at arXiv https://doi.org/10.48550/arXiv.2009.11060 (2020).

Silverman, B. W. Algorithm AS 176: kernel density estimation using the fast Fourier transform. Appl. Stat. 31, 93 (1982).

doi: 10.2307/2347084

Hall, P. & Wand, M. P. On the accuracy of binned kernel density estimators. J. Multivar. Anal. 56, 165–184 (1996).

doi: 10.1006/jmva.1996.0009

Silverman, B. W. Density Estimation for Statistics and Data Analysis (Chapman & Hall, 1986).

Fan, J. & Marron, J. S. Fast implementations of nonparametric curve estimators. J. Comput. Graph. Stat. 3, 35–56 (1994).

Liu, J.-P., Hsueh, H.-M., Hsieh, E. & Chen, J. J. Tests for equivalence or non-inferiority for paired binary data. Stat. Med. 21, 231–245 (2002).

McNemar, Q. Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika 12, 153–157 (1947).

doi: 10.1007/BF02295996 pubmed: 20254758

Fagerland, M. W., Lydersen, S. & Laake, P. Recommended tests and confidence intervals for paired binomial proportions. Stat. Med. 33, 2850–2875 (2014).

doi: 10.1002/sim.6148 pubmed: 24648355

Aickin, M. & Gensler, H. Adjusting for multiple testing when reporting research results: the Bonferroni vs Holm methods. Am. J. Public Health 86, 726–728 (1996).

doi: 10.2105/AJPH.86.5.726 pubmed: 8629727 pmcid: 1380484

Mozannar, H. & Sontag, D. Consistent estimators for learning to defer to an expert. Preprint at arXiv https://doi.org/10.48550/arXiv.2006.01862 (2021).

Wilder, B., Horvitz, E. & Kamar, E. Learning to complement humans. In Proceedings of the 29th International Joint Conference on Artificial Intelligence (Ed. Bessiere, C.) 1526–1533 (International Joint Conferences on Artificial Intelligence Organization, 2020); https://doi.org/10.24963/ijcai.2020/212

Raghu, M. et al. The algorithmic automation problem: prediction, triage, and human effort. Preprint at arXiv https://doi.org/10.48550/arXiv.1903.12220 (2019).

Charusaie, M.-A., Mozannar, H., Sontag, D. & Samadi, S. Sample efficient learning of predictors that complement humans. In Proceedings of the 39th International Conference on Machine Learning (Eds. Chaudhuri. K. et al.) 2972–3005 (PMLR, 2022).

Narasimhan, H., Jitkrittum, W., Menon, A. K., Rawat, A. S. & Kumar, S. Post-hoc estimators for learning to defer to an expert. In Advances in Neural Information Processing Systems (Eds. Koyejo, S. et al.) 29292–29304 (Curran Associates, 2022).

Kerrigan, G., Smyth, P. & Steyvers, M. Combining human predictions with model probabilities via confusion matrices and calibration. In Advances in Neural Information Processing Systems Vol. 34 (Eds. Ranzato, M. A. et al.) 4421–4434 (Curran Associates, Inc., 2021).

Qin, Z. Z. et al. Tuberculosis detection from chest x-rays for triaging in a high tuberculosis-burden setting: an evaluation of five artificial intelligence algorithms. Lancet Digit. Health 3, e543–e554 (2021).

doi: 10.1016/S2589-7500(21)00116-3 pubmed: 34446265