Differential Biases and Variabilities of Deep Learning-Based Artificial Intelligence and Human Experts in Clinical Diagnosis: Retrospective Cohort and Survey Study.

artificial intelligence computer-aided diagnosis convolutional neural network deep learning, class imbalance problem eardrum human-machine cooperation otology otoscopy

Journal

JMIR medical informatics

ISSN: 2291-9694

Titre abrégé: JMIR Med Inform

Pays: Canada

ID NLM: 101645109

Informations de publication

Date de publication:
08 Dec 2021

Historique:

received: 22 08 2021

accepted: 12 10 2021

revised: 29 09 2021

entrez: 10 12 2021

pubmed: 11 12 2021

medline: 11 12 2021

Statut: epublish

Résumé

Deep learning (DL)-based artificial intelligence may have different diagnostic characteristics than human experts in medical diagnosis. As a data-driven knowledge system, heterogeneous population incidence in the clinical world is considered to cause more bias to DL than clinicians. Conversely, by experiencing limited numbers of cases, human experts may exhibit large interindividual variability. Thus, understanding how the 2 groups classify given data differently is an essential step for the cooperative usage of DL in clinical application. This study aimed to evaluate and compare the differential effects of clinical experience in otoendoscopic image diagnosis in both computers and physicians exemplified by the class imbalance problem and guide clinicians when utilizing decision support systems. We used digital otoendoscopic images of patients who visited the outpatient clinic in the Department of Otorhinolaryngology at Severance Hospital, Seoul, South Korea, from January 2013 to June 2019, for a total of 22,707 otoendoscopic images. We excluded similar images, and 7500 otoendoscopic images were selected for labeling. We built a DL-based image classification model to classify the given image into 6 disease categories. Two test sets of 300 images were populated: balanced and imbalanced test sets. We included 14 clinicians (otolaryngologists and nonotolaryngology specialists including general practitioners) and 13 DL-based models. We used accuracy (overall and per-class) and kappa statistics to compare the results of individual physicians and the ML models. Our ML models had consistently high accuracies (balanced test set: mean 77.14%, SD 1.83%; imbalanced test set: mean 82.03%, SD 3.06%), equivalent to those of otolaryngologists (balanced: mean 71.17%, SD 3.37%; imbalanced: mean 72.84%, SD 6.41%) and far better than those of nonotolaryngologists (balanced: mean 45.63%, SD 7.89%; imbalanced: mean 44.08%, SD 15.83%). However, ML models suffered from class imbalance problems (balanced test set: mean 77.14%, SD 1.83%; imbalanced test set: mean 82.03%, SD 3.06%). This was mitigated by data augmentation, particularly for low incidence classes, but rare disease classes still had low per-class accuracies. Human physicians, despite being less affected by prevalence, showed high interphysician variability (ML models: kappa=0.83, SD 0.02; otolaryngologists: kappa=0.60, SD 0.07). Even though ML models deliver excellent performance in classifying ear disease, physicians and ML models have their own strengths. ML models have consistent and high accuracy while considering only the given image and show bias toward prevalence, whereas human physicians have varying performance but do not show bias toward prevalence and may also consider extra information that is not images. To deliver the best patient care in the shortage of otolaryngologists, our ML model can serve a cooperative role for clinicians with diverse expertise, as long as it is kept in mind that models consider only images and could be biased toward prevalent diseases even after data augmentation.

Sections du résumé

BACKGROUND BACKGROUND

OBJECTIVE OBJECTIVE

This study aimed to evaluate and compare the differential effects of clinical experience in otoendoscopic image diagnosis in both computers and physicians exemplified by the class imbalance problem and guide clinicians when utilizing decision support systems.

METHODS METHODS

We used digital otoendoscopic images of patients who visited the outpatient clinic in the Department of Otorhinolaryngology at Severance Hospital, Seoul, South Korea, from January 2013 to June 2019, for a total of 22,707 otoendoscopic images. We excluded similar images, and 7500 otoendoscopic images were selected for labeling. We built a DL-based image classification model to classify the given image into 6 disease categories. Two test sets of 300 images were populated: balanced and imbalanced test sets. We included 14 clinicians (otolaryngologists and nonotolaryngology specialists including general practitioners) and 13 DL-based models. We used accuracy (overall and per-class) and kappa statistics to compare the results of individual physicians and the ML models.

RESULTS RESULTS

Our ML models had consistently high accuracies (balanced test set: mean 77.14%, SD 1.83%; imbalanced test set: mean 82.03%, SD 3.06%), equivalent to those of otolaryngologists (balanced: mean 71.17%, SD 3.37%; imbalanced: mean 72.84%, SD 6.41%) and far better than those of nonotolaryngologists (balanced: mean 45.63%, SD 7.89%; imbalanced: mean 44.08%, SD 15.83%). However, ML models suffered from class imbalance problems (balanced test set: mean 77.14%, SD 1.83%; imbalanced test set: mean 82.03%, SD 3.06%). This was mitigated by data augmentation, particularly for low incidence classes, but rare disease classes still had low per-class accuracies. Human physicians, despite being less affected by prevalence, showed high interphysician variability (ML models: kappa=0.83, SD 0.02; otolaryngologists: kappa=0.60, SD 0.07).

CONCLUSIONS CONCLUSIONS

Even though ML models deliver excellent performance in classifying ear disease, physicians and ML models have their own strengths. ML models have consistent and high accuracy while considering only the given image and show bias toward prevalence, whereas human physicians have varying performance but do not show bias toward prevalence and may also consider extra information that is not images. To deliver the best patient care in the shortage of otolaryngologists, our ML model can serve a cooperative role for clinicians with diverse expertise, as long as it is kept in mind that models consider only images and could be biased toward prevalent diseases even after data augmentation.

Identifiants

DOI: 10.2196/33049 PMID: 34889764 PMC: PMC8701703

pubmed: 34889764

pii: v9i12e33049

doi: 10.2196/33049

pmc: PMC8701703

doi:

Types de publication

Journal Article

Langues

eng

Pagination

e33049

Informations de copyright

©Dongchul Cha, Chongwon Pae, Se A Lee, Gina Na, Young Kyun Hur, Ho Young Lee, A Ra Cho, Young Joon Cho, Sang Gil Han, Sung Huhn Kim, Jae Young Choi, Hae-Jeong Park. Originally published in JMIR Medical Informatics (https://medinform.jmir.org), 08.12.2021.

Références

J Clin Med. 2021 Jul 21;10(15):

pubmed: 34361982

JAMA. 2016 Dec 13;316(22):2402-2410

pubmed: 27898976

EBioMedicine. 2019 Jul;45:606-614

pubmed: 31272902

PLoS Med. 2018 Nov 27;15(11):e1002699

pubmed: 30481176

IEEE Trans Pattern Anal Mach Intell. 2020 Feb;42(2):318-327

pubmed: 30040631

Comput Methods Programs Biomed. 2018 Oct;165:69-76

pubmed: 30337082

Nat Med. 2020 Aug;26(8):1229-1234

pubmed: 32572267

Arch Pediatr Adolesc Med. 2001 Oct;155(10):1137-42

pubmed: 11576009

Lancet Oncol. 2019 Jul;20(7):938-947

pubmed: 31201137

J Surg Educ. 2016 Jan-Feb;73(1):129-35

pubmed: 26364889

Comput Biol Med. 2010 May;40(5):509-18

pubmed: 20347072

Fam Med. 2005 May;37(5):360-3

pubmed: 15883903

Laryngoscope. 2019 Aug;129(8):1891-1897

pubmed: 30329157

Acad Med. 1999 Oct;74(10):1106-17

pubmed: 10536633

J Telemed Telecare. 2018 Aug;24(7):453-459

pubmed: 28480781

Glob Health Action. 2009 Mar 19;2:

pubmed: 20027268

Nat Med. 2019 Jan;25(1):44-56

pubmed: 30617339

Int J Pediatr Otorhinolaryngol. 2005 Mar;69(3):361-6

pubmed: 15733595

EBioMedicine. 2016 Feb 08;5:156-60

pubmed: 27077122

NPJ Sci Learn. 2017 Jan 12;2:2

pubmed: 30631449

Differential Biases and Variabilities of Deep Learning-Based Artificial Intelligence and Human Experts in Clinical Diagnosis: Retrospective Cohort and Survey Study.

Journal

Informations de publication

Résumé

Sections du résumé

Identifiants

Types de publication

Langues

Pagination

Informations de copyright

Références

Auteurs

Dongchul Cha (D)

Chongwon Pae (C)

Se A Lee (SA)

Gina Na (G)

Young Kyun Hur (YK)

Ho Young Lee (HY)

A Ra Cho (AR)

Young Joon Cho (YJ)

Sang Gil Han (SG)

Sung Huhn Kim (SH)

Jae Young Choi (JY)

Hae-Jeong Park (HJ)

Classifications MeSH