Background
The impression section integrates key findings of a radiology report but can be subjective and variable. A fine-tuned open-source Large Language Model (LLM) was evaluated in its ability to generate radiological report impressions across different imaging modalities and hospitals. We sought to clinically validate an open-source fine-tuned LLM that automatically generates impressions to summarize radiology reports.
Methods
In this institutional review board-approved retrospective study, we fine-tuned an open-source LLM to generate the impression from the remainder of the radiology report. CT, US, and MRI radiology reports from Hospital 1 (n = 372716) and Hospital 2 (n = 60049), both under a single institution, were included in this study. The ROUGE score was used for automatic natural language evaluation and a reader study with five thoracic radiologists was performed for a clinical evaluation of CT chest impressions with a subspecialist baseline. We also stratified the results of the reader performance study based on the diagnosis category and the original impression length to gauge case complexity.
Results
The large language model achieved ROUGE-L scores of 46.51, 44.2, and 50.96 on the Hospital 1 dataset across the CT, US, and MRI modalities respectively. Upon external validation on the Hospital 2 independent test dataset, the model achieved ROUGE-L scores of 40.74, 37.89, and 24.61 for the same set of modalities. For the reader performance study, the model achieved overall mean scores of 3.56/4, 3.92/4, and 3.37/4, 18.29 seconds, and 12.32 words for clinical accuracy, grammatical accuracy, stylistic quality, edit time, and edit distance respectively. The LLM achieved the highest clinical accuracy ratings for acute/emergent findings. In terms of impression length, the LLM performed the best in clinical accuracy on shorter impressions.
Conclusions
We demonstrated that an open-source fine-tuned LLM can generate high-quality radiological impressions of clinical accuracy, grammatical accuracy, and stylistic quality across multiple imaging modalities and hospitals.