2021
DOI: 10.3174/ajnr.a7179
|View full text |Cite
|
Sign up to set email alerts
|

Diagnostic Accuracy and Failure Mode Analysis of a Deep Learning Algorithm for the Detection of Cervical Spine Fractures

Abstract: BACKGROUND AND PURPOSE: Artificial intelligence decision support systems are a rapidly growing class of tools to help manage ever-increasing imaging volumes. The aim of this study was to evaluate the performance of an artificial intelligence decision support system, Aidoc, for the detection of cervical spinal fractures on noncontrast cervical spine CT scans and to conduct a failure mode analysis to identify areas of poor performance.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
18
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
2

Relationship

0
6

Authors

Journals

citations
Cited by 58 publications
(19 citation statements)
references
References 23 publications
1
18
0
Order By: Relevance
“… 21 , 22 In addition to developing medical training programs, improvement of assessment of scans might be achieved by investing in artificial intelligence (AI) that, when proven to have a high sensitivity, can further support assessment of cervical spine CT by emergency physicians. 23 When emergency physicians reach sufficient diagnostic accuracy (with or without AI), it would yield opportunities and flexibility to advance clinical decision‐making before the final radiologist report becomes available.…”
Section: Discussionmentioning
confidence: 99%
“… 21 , 22 In addition to developing medical training programs, improvement of assessment of scans might be achieved by investing in artificial intelligence (AI) that, when proven to have a high sensitivity, can further support assessment of cervical spine CT by emergency physicians. 23 When emergency physicians reach sufficient diagnostic accuracy (with or without AI), it would yield opportunities and flexibility to advance clinical decision‐making before the final radiologist report becomes available.…”
Section: Discussionmentioning
confidence: 99%
“…For AI-based medical devices, conducting sanity tests can prevent needless harm to the patient and save a considerable resources. However, without sufficiently large, well-annotated datasets, performing analytical validation to determine the root causes that drive AI systems to fail before deployment remains a challenge ( 5 , 35 ). Moreover, after independent testing data is gathered, regulatory organizations advise that the data be used a limited number of times to prevent over-fitting ( 36 ).…”
Section: Methodsmentioning
confidence: 99%
“…Artificially intelligent (AI) computer-aided diagnostic (CAD) systems have the potential to help radiologists on a multitude of tasks, ranging from tumor classification to improved image reconstruction (1)(2)(3)(4). To deploy medical AI systems, it is essential to validate their performance correctly and to understand their weaknesses before being used on patients (5)(6)(7)(8). For AI-based software as a medical device, the gold standard for analytical validation is to assess performance on previously unseen independent datasets (9-12), followed by a clinical validation study.…”
Section: Introductionmentioning
confidence: 99%
“…All seventeen studies used a CNN to detect and /or classify fractures on CT scans [12][13][14][15][16][17][18][19][20][21][22][23][24][25][26][27][28]. Eight studies addressed detection of rib fractures [13,17,19,20,22,[25][26][27], three studies the performance for detection [12,21] and classification [18] of pelvic fractures, four for detection of spine fractures [14,16,23,28], one for detection and classification of femur fractures [24] and one of calcaneal fractures [15]. Fourteen studies used two output classes (fracture yes/no).…”
Section: Description Of Studiesmentioning
confidence: 99%
“…Eight studies used the F1-score to assess performance instead: in two the F1-score was assessed for the classification of healing status [25,26], in one for displacement [21], and in five [13,[18][19][20]22] for the detection of fractures. Additionally, we calculated the F1-scores in three studies [12,23,28] to facilitate comparison. F1-scores ranged from 0.35 in Yacoub et al [23] to 0.94 in Meng et al [20].…”
Section: Primary Outcome: the Performance Of Cnnmentioning
confidence: 99%