2021
DOI: 10.1038/s41746-020-00380-6
|View full text |Cite
|
Sign up to set email alerts
|

Stress testing reveals gaps in clinic readiness of image-based diagnostic artificial intelligence models

Abstract: Artificial intelligence models match or exceed dermatologists in melanoma image classification. Less is known about their robustness against real-world variations, and clinicians may incorrectly assume that a model with an acceptable area under the receiver operating characteristic curve or related performance metric is ready for clinical use. Here, we systematically assessed the performance of dermatologist-level convolutional neural networks (CNNs) on real-world non-curated images by applying computational “… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
17
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
8
2

Relationship

0
10

Authors

Journals

citations
Cited by 32 publications
(17 citation statements)
references
References 30 publications
0
17
0
Order By: Relevance
“…5 Despite the rapid development of AI technology, AI has been implemented in few real-world settings because of the practical challenges of implementation 6 and the absence of validation using metrics other than accuracy; metrics such as calibration and robustness are rarely calculated. 7 To aid AI implementation, randomised trials for interventions involving AI should follow the recently updated Consolidated Standards of Reporting Trials-AI 8 (updated in 2019) and Standard Protocol Items: Recommendations for Interventional Trials-AI 9 (updated in 2020) guide lines to promote trans parency and completeness. Moreover, there is a need to better understand the perspective of patients, who have the most at stake.…”
Section: Introductionmentioning
confidence: 99%
“…5 Despite the rapid development of AI technology, AI has been implemented in few real-world settings because of the practical challenges of implementation 6 and the absence of validation using metrics other than accuracy; metrics such as calibration and robustness are rarely calculated. 7 To aid AI implementation, randomised trials for interventions involving AI should follow the recently updated Consolidated Standards of Reporting Trials-AI 8 (updated in 2019) and Standard Protocol Items: Recommendations for Interventional Trials-AI 9 (updated in 2020) guide lines to promote trans parency and completeness. Moreover, there is a need to better understand the perspective of patients, who have the most at stake.…”
Section: Introductionmentioning
confidence: 99%
“…For example, one AI algorithm had an excellent performance distinguishing between nevus or melanoma as long as good quality pictures were provided. 28 In this case, providing doctors explanations that the algorithm only functions within these parameters allows them to evaluate its functioning for the general clinical scenario and specific clinical circumstances. These explanations will facilitate doctors’ understanding and justify disregarding AI results, for example, when a picture has lower quality.…”
Section: Considering Explainability In Healthcarementioning
confidence: 99%
“…By using multiple, well-designed stress tests, modelers can ensure that the produced model is broadly generalizable. Stress testing has already been shown to rule out spurious models in other domains such as dermatology and natural language processing (68,69). D'Amour et al (41) considered three types of stress tests: shifted performance evaluation, contrastive evaluation, and stratified performance evaluation (Fig 2).…”
Section: Overcoming Underspecification With Stress Testsmentioning
confidence: 99%