The Need for Targeted Labeling of Machine Learning–Based Software as a Medical Device

Goldstein, Benjamin A.; Mazurowski, Maciej A.; Li, Cheng

doi:10.1001/jamanetworkopen.2022.42351

Cited by 1 publication

(1 citation statement)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This is a dilemma considering the potential advantages that AI technologies 5 have to offer in improving health care delivery. The need for methods of assessing model generalizability is becoming even more important with the advent of evolving algorithms that may be piece-wise modified by the device developer, 6 potentially updated continuously 7 or tailored to a local population 8 – 10 or specific subgroups 11 . Thus, there is a need to innovate and help overcome some of the generalizability assessment limitations in AI-enabled medical devices.…”

Section: Introductionmentioning

confidence: 99%

Decision region analysis for generalizability of artificial intelligence models: estimating model generalizability in the case of cross-reactivity and population shift

Burgon,

Sahiner,

Petrick

et al. 2024

J. Med. Imag.

View full text Add to dashboard Cite

.PurposeUnderstanding an artificial intelligence (AI) model’s ability to generalize to its target population is critical to ensuring the safe and effective usage of AI in medical devices. A traditional generalizability assessment relies on the availability of large, diverse datasets, which are difficult to obtain in many medical imaging applications. We present an approach for enhanced generalizability assessment by examining the decision space beyond the available testing data distribution.ApproachVicinal distributions of virtual samples are generated by interpolating between triplets of test images. The generated virtual samples leverage the characteristics already in the test set, increasing the sample diversity while remaining close to the AI model’s data manifold. We demonstrate the generalizability assessment approach on the non-clinical tasks of classifying patient sex, race, COVID status, and age group from chest x-rays.ResultsDecision region composition analysis for generalizability indicated that a disproportionately large portion of the decision space belonged to a single “preferred” class for each task, despite comparable performance on the evaluation dataset. Evaluation using cross-reactivity and population shift strategies indicated a tendency to overpredict samples as belonging to the preferred class (e.g., COVID negative) for patients whose subgroup was not represented in the model development data.ConclusionsAn analysis of an AI model’s decision space has the potential to provide insight into model generalizability. Our approach uses the analysis of composition of the decision space to obtain an improved assessment of model generalizability in the case of limited test data.

show abstract