2021
DOI: 10.48550/arxiv.2103.06205
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Are we using appropriate segmentation metrics? Identifying correlates of human expert perception for CNN training beyond rolling the DICE coefficient

Abstract: In this study, we explore quantitative correlates of qualitative human expert perception. We discover that current quality metrics and loss functions, considered for biomedical image segmentation tasks, correlate moderately with segmentation quality assessment by experts, especially for small yet clinically relevant structures, such as the enhancing tumor in brain glioma. We propose a method employing classical statistics and experimental psychology to create complementary compound loss functions for modern de… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
24
1

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
3
1

Relationship

5
5

Authors

Journals

citations
Cited by 17 publications
(25 citation statements)
references
References 28 publications
0
24
1
Order By: Relevance
“…The global Dice scores for the different classes were strongly influenced by the prevalence of each class in the data. The classes that included many voxels and were present in most or all slices had high global Dice values, whereas the classes that were only present in a few slices and were completely absent for some scans had low global Dice values [ 36 ]. This is a well-known limitation of the Dice metric [ 37 ] and is particularly evident for our application where there is also a high degree of uncertainty in the precise voxel-wise labelling of the classes, especially in the hold out test dataset.…”
Section: Discussionmentioning
confidence: 99%
“…The global Dice scores for the different classes were strongly influenced by the prevalence of each class in the data. The classes that included many voxels and were present in most or all slices had high global Dice values, whereas the classes that were only present in a few slices and were completely absent for some scans had low global Dice values [ 36 ]. This is a well-known limitation of the Dice metric [ 37 ] and is particularly evident for our application where there is also a high degree of uncertainty in the precise voxel-wise labelling of the classes, especially in the hold out test dataset.…”
Section: Discussionmentioning
confidence: 99%
“…Therefore, a hypothetical perfect segmentation made by an algorithm (extraction of the whole tumor and no non-tumoral pixels) would not obtain a maximum score since the metric measures the difference between the contours drawn by humans and those drawn by the algorithm. Furthermore, a weak correlation between expert assessment and traditional metrics has been noted 27 . We used the DICE coefficient, the most used metric in biomedical image analysis competition 28 , even though there is no consensus on the method to evaluate the models.…”
Section: Discussionmentioning
confidence: 99%
“…The Dice score and the Hausdorff distance are the two most standard metrics used for measuring the quality of automatic segmentations [4]. However, those two metrics do not directly measure the trustworthiness of segmentation algorithms [33]. Therefore, we have also conducted an evaluation of the trustworthiness of the automatic segmentations as perceived by radiologists.…”
Section: Stratified Evaluation Across Brain Conditions and Acquisitio...mentioning
confidence: 99%