Meta-repository of screening mammography classifiers

Stadnick, Benjamin; Witowski, Jan; Rajiv, Vishwaesh; Chłędowski, Jakub; Shamout, Farah E.; Cho, Kyunghyun; Geras, Krzysztof J.

doi:10.48550/arxiv.2108.04800

Cited by 5 publications

(11 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For the GMIC model, we note a discrepancy on the CMMD result on Table 1 (AUC=81.03) and the published result in [19] (AUC=82.50). This is explained by the different training set and input image setup used by GMIC in [19], so to enable a fair comparison, we present the result by GMIC with the same experimental conditions as all other methods in the Table . Fig. 2 (a) displays the learned non-cancer and cancer prototypes and their source training images.…”

Section: Methodscontrasting

confidence: 72%

“…It is observed that using DenseNet-121 as backbone exhibits better generalisation results on CMMD than using EfficientNet-B0, which means that DenseNet-121 is more robust against domain shift [6]. For the GMIC model, we note a discrepancy on the CMMD result on Table 1 (AUC=81.03) and the published result in [19] (AUC=82.50). This is explained by the different training set and input image setup used by GMIC in [19], so to enable a fair comparison, we present the result by GMIC with the same experimental conditions as all other methods in the Table . Fig.…”

Section: Methodsmentioning

confidence: 76%

See 1 more Smart Citation

Knowledge Distillation to Ensemble Global and Interpretable Prototype-Based Mammogram Classification Models

Wang

Chen

Liu

et al. 2022

Lecture Notes in Computer Science

View full text Add to dashboard Cite

State-of-the-art (SOTA) deep learning mammogram classifiers, trained with weakly-labelled images, often rely on global models that produce predictions with limited interpretability, which is a key barrier to their successful translation into clinical practice. On the other hand, prototype-based models improve interpretability by associating predictions with training image prototypes, but they are less accurate than global models and their prototypes tend to have poor diversity. We address these two issues with the proposal of BRAIxProtoPNet++, which adds interpretability to a global model by ensembling it with a prototype-based model. BRAIxProtoPNet++ distills the knowledge of the global model when training the prototype-based model with the goal of increasing the classification accuracy of the ensemble. Moreover, we propose an approach to increase prototype diversity by guaranteeing that all prototypes are associated with different training images. Experiments on weakly-labelled private and public datasets show that BRAIx-ProtoPNet++ has higher classification accuracy than SOTA global and prototype-based models. Using lesion localisation to assess model interpretability, we show BRAIxProtoPNet++ is more effective than other prototype-based models and post-hoc explanation of global models. Finally, we show that the diversity of the prototypes learned by BRAIx-ProtoPNet++ is superior to SOTA prototype-based approaches.

show abstract

Section: Methodscontrasting

confidence: 72%

Section: Methodsmentioning

confidence: 76%

Knowledge Distillation to Ensemble Global and Interpretable Prototype-Based Mammogram Classification Models

Wang

Chen

Liu

et al. 2022

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…When evaluating predictions on the test set, we assessed the breast-wise predictions, similar to [42]. For many of the patients in our datasets, there were two images of each breast, one from each of the Craniocaudal (CC) and Medio-lateral Oblique (MLO) views.…”

Section: Discussionmentioning

confidence: 99%

“…Finally, the popular public INBreast [17] dataset was used solely for testing performance across all experiments, allowing a comparison to other studies in the area [42]. This dataset consists of 410 FFDM images taken with a Siemens mammography system.…”

Section: Data and Labelsmentioning

confidence: 99%

“…Stadnick et al [42] tested several state-of-the-art models [19,23,24,58] on multiple public datasets, including INBreast and CMMD. The authors used only 26% of the INBreast images in their test set, and test on the full CMMD dataset, so we re-evaluated our Artifacted model trained on D HMI train , yielding an AUC-ROC of 0.850 on this INBreast subset and 0.718 on the full labelled CMMD dataset of 3728 images.…”

Section: Comparison To Previous Workmentioning

confidence: 99%

See 1 more Smart Citation

A Comparison of Techniques for Class Imbalance in Deep Learning Classification of Breast Cancer

Walsh

Tardy

2022

Diagnostics

View full text Add to dashboard Cite

Tools based on deep learning models have been created in recent years to aid radiologists in the diagnosis of breast cancer from mammograms. However, the datasets used to train these models may suffer from class imbalance, i.e., there are often fewer malignant samples than benign or healthy cases, which can bias the model towards the healthy class. In this study, we systematically evaluate several popular techniques to deal with this class imbalance, namely, class weighting, over-sampling, and under-sampling, as well as a synthetic lesion generation approach to increase the number of malignant samples. These techniques are applied when training on three diverse Full-Field Digital Mammography datasets, and tested on in-distribution and out-of-distribution samples. The experiments show that a greater imbalance is associated with a greater bias towards the majority class, which can be counteracted by any of the standard class imbalance techniques. On the other hand, these methods provide no benefit to model performance with respect to Area Under the Curve of the Recall Operating Characteristic (AUC-ROC), and indeed under-sampling leads to a reduction of 0.066 in AUC in the case of a 19:1 benign to malignant imbalance. Our synthetic lesion methodology leads to better performance in most cases, with increases of up to 0.07 in AUC on out-of-distribution test sets over the next best experiment.

show abstract

Enhancing the reliability and accuracy of AI-enabled diagnosis via complementarity-driven deferral to clinicians

Dvijotham

Winkens

Barsbey

et al. 2023

Nat Med

Self Cite

View full text Add to dashboard Cite

Diagnostic AI systems trained using deep learning have been shown to achieve expert-level identi cation of diseases in multiple medical imaging settings 1,2 . However, such systems are not always reliable and can fail in cases diagnosed accurately by clinicians and vice versa 3 . Mechanisms for leveraging this complementarity by learning to select optimally between discordant decisions of AIs and clinicians have remained largely unexplored in healthcare 4 , yet have the potential to achieve levels of performance that exceed that possible from either AI or clinician alone 4 .We develop a Complementarity-driven Deferral-to-Clinical Work ow (CoDoC) system that can learn to decide when to rely on a diagnostic AI model and when to defer to a clinician or their work ow. We show that our system is compatible with diagnostic AI models from multiple manufacturers, obtaining enhanced accuracy (sensitivity and/or speci city) relative to clinician-only or AI-only baselines in clinical work ows that screen for breast cancer or tuberculosis. For breast cancer, we demonstrate the rst system that exceeds the accuracy of double-reading with arbitration (the "gold standard" of care) in a large representative UK screening program, with 25% reduction in false positives despite equivalent truepositive detection, while achieving a 66% reduction in clinical workload. In two separate US datasets, CoDoC exceeds the accuracy of single-reading by board certi ed radiologists and two different standalone state-of-the-art AI systems, with generalisation of this nding in different diagnostic AI manufacturers. For TB screening with chest X-rays, CoDoC improved speci city (while maintaining sensitivity) compared to standalone AI or clinicians for 3 of 5 commercially available diagnostic AI systems (5-15% reduction in false positives). Further, we show the limits of con dence score based deferral systems for medical AI, by demonstrating that no deferral strategy could have achieved signi cant improvement on the remaining two diagnostic AI systems.Our comprehensive assessment demonstrates that the superiority of CoDoC is sustained in multiple realistic stress tests for generalisation of medical AI tools along four axes: variation in the medical imaging modality; variation in clinical settings and human experts; different clinical deferral pathways within a given modality; and different AI softwares. Further, given the simplicity of CoDoC we believe that practitioners can easily adapt it and we provide an open-source implementation to encourage widespread further research and application.

show abstract

Meta-repository of screening mammography classifiers

Cited by 5 publications

References 31 publications

Knowledge Distillation to Ensemble Global and Interpretable Prototype-Based Mammogram Classification Models

Knowledge Distillation to Ensemble Global and Interpretable Prototype-Based Mammogram Classification Models

A Comparison of Techniques for Class Imbalance in Deep Learning Classification of Breast Cancer

Enhancing the reliability and accuracy of AI-enabled diagnosis via complementarity-driven deferral to clinicians

Contact Info

Product

Resources

About