Is Deep Learning On Par with Human Observers for Detection of Radiographically Visible and Occult Fractures of the Scaphoid?

Adult Deep Learning Female Fractures, Closed / diagnostic imaging Humans Male Middle Aged Neural Networks, Computer Observation Orthopedic Surgeons Radiography Reproducibility of Results Retrospective Studies Scaphoid Bone / diagnostic imaging Young Adult

Journal

Clinical orthopaedics and related research

ISSN: 1528-1132

Titre abrégé: Clin Orthop Relat Res

Pays: United States

ID NLM: 0075674

Informations de publication

Date de publication:
11 2020

Historique:

pubmed: 27 5 2020

medline: 25 5 2021

entrez: 27 5 2020

Statut: ppublish

Résumé

Preliminary experience suggests that deep learning algorithms are nearly as good as humans in detecting common, displaced, and relatively obvious fractures (such as, distal radius or hip fractures). However, it is not known whether this also is true for subtle or relatively nondisplaced fractures that are often difficult to see on radiographs, such as scaphoid fractures. (1) What is the diagnostic accuracy, sensitivity, and specificity of a deep learning algorithm in detecting radiographically visible and occult scaphoid fractures using four radiographic imaging views? (2) Does adding patient demographic (age and sex) information improve the diagnostic performance of the deep learning algorithm? (3) Are orthopaedic surgeons better at diagnostic accuracy, sensitivity, and specificity compared with deep learning? (4) What is the interobserver reliability among five human observers and between human consensus and deep learning algorithm? We retrospectively searched the picture archiving and communication system (PACS) to identify 300 patients with a radiographic scaphoid series, until we had 150 fractures (127 visible on radiographs and 23 only visible on MRI) and 150 non-fractures with a corresponding CT or MRI as the reference standard for fracture diagnosis. At our institution, MRIs are usually ordered for patients with scaphoid tenderness and normal radiographs, and a CT with radiographically visible scaphoid fracture. We used a deep learning algorithm (a convolutional neural network [CNN]) for automated fracture detection on radiographs. Deep learning, an advanced subset of artificial intelligence, combines artificial neuronal layers to resemble a neuron cell. CNNs-essentially deep learning algorithms resembling interconnected neurons in the human brain-are most commonly used for image analysis. Area under the receiver operating characteristic curve (AUC) was used to evaluate the algorithm's diagnostic performance. An AUC of 1.0 would indicate perfect prediction, whereas 0.5 would indicate that a prediction is no better than a flip of a coin. The probability of a scaphoid fracture generated by the CNN, sex, and age were included in a multivariable logistic regression to determine whether this would improve the algorithm's diagnostic performance. Diagnostic performance characteristics (accuracy, sensitivity, and specificity) and reliability (kappa statistic) were calculated for the CNN and for the five orthopaedic surgeon observers in our study. The algorithm had an AUC of 0.77 (95% CI 0.66 to 0.85), 72% accuracy (95% CI 60% to 84%), 84% sensitivity (95% CI 0.74 to 0.94), and 60% specificity (95% CI 0.46 to 0.74). Adding age and sex did not improve diagnostic performance (AUC 0.81 [95% CI 0.73 to 0.89]). Orthopaedic surgeons had better specificity (0.93 [95% CI 0.93 to 0.99]; p < 0.01), while accuracy (84% [95% CI 81% to 88%]) and sensitivity (0.76 [95% CI 0.70 to 0.82]; p = 0.29) did not differ between the algorithm and human observers. Although the CNN was less specific in diagnosing relatively obvious fractures, it detected five of six occult scaphoid fractures that were missed by all human observers. The interobserver reliability among the five surgeons was substantial (Fleiss' kappa = 0.74 [95% CI 0.66 to 0.83]), but the reliability between the algorithm and human observers was only fair (Cohen's kappa = 0.34 [95% CI 0.17 to 0.50]). Initial experience with our deep learning algorithm suggests that it has trouble identifying scaphoid fractures that are obvious to human observers. Thirteen false positive suggestions were made by the CNN, which were correctly detected by the five surgeons. Research with larger datasets-preferably also including information from physical examination-or further algorithm refinement is merited. Level III, diagnostic study.

Sections du résumé

BACKGROUND

QUESTIONS/PURPOSES

(1) What is the diagnostic accuracy, sensitivity, and specificity of a deep learning algorithm in detecting radiographically visible and occult scaphoid fractures using four radiographic imaging views? (2) Does adding patient demographic (age and sex) information improve the diagnostic performance of the deep learning algorithm? (3) Are orthopaedic surgeons better at diagnostic accuracy, sensitivity, and specificity compared with deep learning? (4) What is the interobserver reliability among five human observers and between human consensus and deep learning algorithm?

METHODS

We retrospectively searched the picture archiving and communication system (PACS) to identify 300 patients with a radiographic scaphoid series, until we had 150 fractures (127 visible on radiographs and 23 only visible on MRI) and 150 non-fractures with a corresponding CT or MRI as the reference standard for fracture diagnosis. At our institution, MRIs are usually ordered for patients with scaphoid tenderness and normal radiographs, and a CT with radiographically visible scaphoid fracture. We used a deep learning algorithm (a convolutional neural network [CNN]) for automated fracture detection on radiographs. Deep learning, an advanced subset of artificial intelligence, combines artificial neuronal layers to resemble a neuron cell. CNNs-essentially deep learning algorithms resembling interconnected neurons in the human brain-are most commonly used for image analysis. Area under the receiver operating characteristic curve (AUC) was used to evaluate the algorithm's diagnostic performance. An AUC of 1.0 would indicate perfect prediction, whereas 0.5 would indicate that a prediction is no better than a flip of a coin. The probability of a scaphoid fracture generated by the CNN, sex, and age were included in a multivariable logistic regression to determine whether this would improve the algorithm's diagnostic performance. Diagnostic performance characteristics (accuracy, sensitivity, and specificity) and reliability (kappa statistic) were calculated for the CNN and for the five orthopaedic surgeon observers in our study.

RESULTS

The algorithm had an AUC of 0.77 (95% CI 0.66 to 0.85), 72% accuracy (95% CI 60% to 84%), 84% sensitivity (95% CI 0.74 to 0.94), and 60% specificity (95% CI 0.46 to 0.74). Adding age and sex did not improve diagnostic performance (AUC 0.81 [95% CI 0.73 to 0.89]). Orthopaedic surgeons had better specificity (0.93 [95% CI 0.93 to 0.99]; p < 0.01), while accuracy (84% [95% CI 81% to 88%]) and sensitivity (0.76 [95% CI 0.70 to 0.82]; p = 0.29) did not differ between the algorithm and human observers. Although the CNN was less specific in diagnosing relatively obvious fractures, it detected five of six occult scaphoid fractures that were missed by all human observers. The interobserver reliability among the five surgeons was substantial (Fleiss' kappa = 0.74 [95% CI 0.66 to 0.83]), but the reliability between the algorithm and human observers was only fair (Cohen's kappa = 0.34 [95% CI 0.17 to 0.50]).

CONCLUSIONS

Initial experience with our deep learning algorithm suggests that it has trouble identifying scaphoid fractures that are obvious to human observers. Thirteen false positive suggestions were made by the CNN, which were correctly detected by the five surgeons. Research with larger datasets-preferably also including information from physical examination-or further algorithm refinement is merited.

LEVEL OF EVIDENCE

Level III, diagnostic study.

Identifiants

DOI: 10.1097/CORR.0000000000001318 PMID: 32452927 PMC: PMC7571968

pubmed: 32452927

doi: 10.1097/CORR.0000000000001318

pmc: PMC7571968

pii: 00003086-202011000-00039

doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

Pagination

2653-2659

Commentaires et corrections

Type : CommentIn

Références

Injury. 2010 Oct;41(10):1026-30

pubmed: 20510414

Lancet. 2018 Dec 1;392(10162):2388-2396

pubmed: 30318264

Nature. 2017 Feb 2;542(7639):115-118

pubmed: 28117445

Acta Orthop. 2019 Aug;90(4):394-400

pubmed: 30942136

Skeletal Radiol. 2019 Feb;48(2):239-244

pubmed: 29955910

Acta Orthop. 2018 Aug;89(4):468-473

pubmed: 29577791

Clin Orthop Relat Res. 2019 Nov;477(11):2482-2491

pubmed: 31283727

Acta Orthop. 2017 Dec;88(6):581-586

pubmed: 28681679

Proc Natl Acad Sci U S A. 2018 Nov 6;115(45):11591-11596

pubmed: 30348771

Clin Radiol. 2018 May;73(5):439-445

pubmed: 29269036

Biometrics. 1977 Mar;33(1):159-74

pubmed: 843571

J Bone Joint Surg Br. 2012 Jul;94(7):961-8

pubmed: 22733954

Is Deep Learning On Par with Human Observers for Detection of Radiographically Visible and Occult Fractures of the Scaphoid?

Journal

Informations de publication

Résumé

Sections du résumé

Identifiants

Types de publication

Langues

Sous-ensembles de citation

Pagination

Commentaires et corrections

Références

Auteurs

David W G Langerhuizen (DWG)

Anne Eva J Bulstra (AEJ)

Stein J Janssen (SJ)

David Ring (D)

Gino M M J Kerkhoffs (GMMJ)

Ruurd L Jaarsma (RL)

Job N Doornberg (JN)

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Smoking Cessation and Incident Cardiovascular Disease.

Evaluation of Low-Value Services Across Major Medicare Advantage Insurers and Traditional Medicare.

Effectiveness of Virtual Yoga for Chronic Low Back Pain: A Randomized Clinical Trial.

Classifications MeSH