2023
DOI: 10.1186/s13000-023-01355-3
|View full text |Cite
|
Sign up to set email alerts
|

Biased data, biased AI: deep networks predict the acquisition site of TCGA images

Abstract: Background Deep learning models applied to healthcare applications including digital pathology have been increasing their scope and importance in recent years. Many of these models have been trained on The Cancer Genome Atlas (TCGA) atlas of digital images, or use it as a validation source. One crucial factor that seems to have been widely ignored is the internal bias that originates from the institutions that contributed WSIs to the TCGA dataset, and its effects on models trained on this datas… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
6
0

Year Published

2023
2023
2025
2025

Publication Types

Select...
6
2
1

Relationship

0
9

Authors

Journals

citations
Cited by 19 publications
(6 citation statements)
references
References 35 publications
0
6
0
Order By: Relevance
“…Regardless of model/algorithmic goals, AI/ML must be evaluated using a diverse range of externally sourced data because deep models have a tendency to learn medically irrelevant shortcuts to achieve their medically relevant goals. For instance, it has been shown that models trained on The Cancer Genome Atlas (TCGA) WSIs for cancer subtype classification learned to distinguish hospitals and medical centers that provided WSIs 59 . Additionally, researchers usually evaluate their models on their own data from their own institution.…”
Section: Discussionmentioning
confidence: 99%
“…Regardless of model/algorithmic goals, AI/ML must be evaluated using a diverse range of externally sourced data because deep models have a tendency to learn medically irrelevant shortcuts to achieve their medically relevant goals. For instance, it has been shown that models trained on The Cancer Genome Atlas (TCGA) WSIs for cancer subtype classification learned to distinguish hospitals and medical centers that provided WSIs 59 . Additionally, researchers usually evaluate their models on their own data from their own institution.…”
Section: Discussionmentioning
confidence: 99%
“…It is also worth noting that implementing a system that enables rapid and consistent imaging, correction, and virtual staining of tissue samples would significantly enhance stain uniformity/repeatability. This is particularly crucial considering the lab-based biases present in extensive and reputable databases, such as the digital image collection of The Cancer Genome Atlas (TCGA) ( Dehkharghanian et al, 2023 ).…”
Section: Discussionmentioning
confidence: 99%
“…The widespread use of The Cancer Genome Atlas (TCGA) dataset, as seen in 42% of the studies included in our review, further underscores the importance of addressing dataset biases. Some models trained on TCGA have shown a tendency to recognize specific institutional patterns, which, although not medically relevant, could unintentionally affect model performance [ 88 , 89 ]. Moreover, the lack of cross-validation among different cohorts, potential lab-induced tissue artifacts, and the biases from institutional patterns limit model generalizability and clinical application.…”
Section: Discussionmentioning
confidence: 99%
“…The widespread use of The Cancer Genome Atlas (TCGA) dataset, as seen in 42% of the studies included in our review, further underscores the importance of addressing dataset biases. Some models trained on TCGA have shown a tendency to recognize specific institutional patterns, which, although not medically relevant, could unintentionally affect model performance [88,89].…”
Section: Navigating the Future: Challenges And Improvements In Bc Cpathmentioning
confidence: 99%