2024
DOI: 10.21203/rs.3.rs-3940387/v1
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Consensus, dissensus and synergy between clinicians and specialist foundation models in radiology report generation

Ryutaro Tanno,
David Barrett,
Andrew Sellergren
et al.

Abstract: Radiology reports are an instrumental part of modern medicine, informing key clinical decisions such as diagnosis and treatment. The worldwide shortage of radiologists, however, restricts access to expert care and imposes heavy workloads, contributing to avoidable errors in report delivery. While recent progress in automated report generation with vision-language models offers clear potential to ameliorate this situation, the path toward real-world adoption has been stymied by the challenge of evaluating the c… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2024
2024
2024
2024

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(1 citation statement)
references
References 31 publications
0
1
0
Order By: Relevance
“…Though promising, this framework has not been thoroughly studied or implemented. In another study evaluating a radiological vision-language model’s output, the authors employed both automated evaluation via the popular NLP metrics such as BLEU score and Rouge-L, and human expert evaluation, noting that the former could not properly assess for factual correctness and consistency – properties that are vital for clinical utility Tanno et al (2024). Another study on LLM outputs for medical evidence summarization tasks also employed both automatic and human evaluation.…”
Section: Related Workmentioning
confidence: 99%
“…Though promising, this framework has not been thoroughly studied or implemented. In another study evaluating a radiological vision-language model’s output, the authors employed both automated evaluation via the popular NLP metrics such as BLEU score and Rouge-L, and human expert evaluation, noting that the former could not properly assess for factual correctness and consistency – properties that are vital for clinical utility Tanno et al (2024). Another study on LLM outputs for medical evidence summarization tasks also employed both automatic and human evaluation.…”
Section: Related Workmentioning
confidence: 99%