2022
DOI: 10.1101/2022.08.30.22279318
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Evaluating Progress in Automatic Chest X-Ray Radiology Report Generation

Abstract: The application of AI to medical image interpretation tasks has largely been limited to the identification of a handful of individual pathologies. In contrast, the generation of complete narrative radiology reports more closely matches how radiologists communicate diagnostic information in clinical workflows. Recent progress in artificial intelligence (AI) on vision-language tasks has enabled the possibility of generating high-quality radiology reports from medical images. Automated metrics to evaluate the qua… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
9
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
3
2
2

Relationship

0
7

Authors

Journals

citations
Cited by 12 publications
(21 citation statements)
references
References 33 publications
1
9
0
Order By: Relevance
“…Delbrouck et al 82 proposed the RadGraph-based metric to calculate the overlap of the clinical entities and relations between generated reports and references annotated by the RadGraph schema. Recently, Yu et al 85 examined the correlation between existing automatic evaluation metrics including BLEU 86 , BERTScore, F1 CheXpert, and RadGraph F1, and the score given by radiologists on evaluating the factuality of the generated reports. They found that the evaluation results of F1 CheXpert and BLEU were not aligned with that of radiologists, and BERTScore and RadGraph F1 were more reliable.…”
Section: Radiology Report Generationmentioning
confidence: 99%
See 2 more Smart Citations
“…Delbrouck et al 82 proposed the RadGraph-based metric to calculate the overlap of the clinical entities and relations between generated reports and references annotated by the RadGraph schema. Recently, Yu et al 85 examined the correlation between existing automatic evaluation metrics including BLEU 86 , BERTScore, F1 CheXpert, and RadGraph F1, and the score given by radiologists on evaluating the factuality of the generated reports. They found that the evaluation results of F1 CheXpert and BLEU were not aligned with that of radiologists, and BERTScore and RadGraph F1 were more reliable.…”
Section: Radiology Report Generationmentioning
confidence: 99%
“…Moreover, the radiology report generation task requires the combination of information both radiology images and the associated text reports, we believe cross-modality vision-language foundation models 88 should be explored to improve the faithfulness of radiology report generation methods in the future. For the evaluation metrics, there is only one paper 85 as we described above on analyzing the correlation between automatic factuality evaluation metrics and scores of experts based on human annotation. It is necessary to have more efforts on developing automatic factuality evaluation metrics, and creating public benchmark datasets to help the meta-evaluation of these metrics.…”
Section: Radiology Report Generationmentioning
confidence: 99%
See 1 more Smart Citation
“…Prior works have approached this challenge by proposing automated metrics for evaluating the clinical quality of generated reports (Jain et al, 2021;Khanna et al, 2023;Liu et al, 2019;Yu et al, 2023) but significant limitations remain. Firstly, there has been a paucity of comprehensive evaluation of automated reports against reports produced by human experts (certified radiologists), which are known themselves to have variable style and quality.…”
Section: Introductionmentioning
confidence: 99%
“…In addition to the above evaluation challenges, there remains considerable headroom for improvement in clinical accuracy of existing AI report generation models (Yu et al, 2023). Recent breakthroughs in multi-modal foundation models (Li et al, 2023a) have demonstrated that AI systems trained on a vast quantity of unlabelled data can be adapted and achieve state-of-the-art accuracy in a wide range of downstream specialised tasks, including biomedical problems (Li et al, 2023b).…”
Section: Introductionmentioning
confidence: 99%