Templated Text Synthesis for Expert-Guided Multi-Label Extraction from Radiology Reports

Schrempf, Patrick; Watson, Hannah; Park, Eunsoo; Pajak, Maciej; MacKinnon, Hamish; Muir, Keith W.; Harris-Birtill, David; O’Neil, Alison Q.

doi:10.3390/make3020015

Cited by 8 publications

(6 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There is potential for exploration with other types of machine learning models that may perform better across different data sources. Schrempf et al ( 18 ) compared the EdIE-R and ALARM+ approaches on their dataset and found similar findings. However, the reason for the differences could relate to other variances in the training data, such as underlying population characteristics.…”

Section: Discussionmentioning

confidence: 71%

“…Chapman et al ( 23 ), who developed ConTEXT, which looks for contextual features (negation, temporality, or who has experienced the condition, e.g., patient, family, and member), have shown how this may assist annotators when labelling by identifying these uncertain conditions to support classification. Other work, such as that by the ALARM+ authors, considered how a template method could improve understanding of uncertain terminology ( 18 ). They defined terminology that should be used to map to uncertain, positive, and negative entities, and this vocabulary was gathered throughout the annotation.…”

Section: Discussionmentioning

confidence: 99%

“…6 This part of the project aimed to detect indications and contra-indications for giving thrombolysis treatment to patients with acute stroke. ALARM+ was trained on anonymised radiology reports corresponding to non-contrast head CT scans from the stroke event and the 18 months preceding and following that event and on synthetic text data (18). This model provides sentence-level predictions regarding whether a phenotype is present, negative, uncertain, or not mentioned.…”

Section: Nlp Toolsmentioning

confidence: 99%

See 2 more Smart Citations

Understanding the performance and reliability of NLP tools: a comparison of four NLP tools predicting stroke phenotypes in radiology reports

Casey,

Davidson,

Grover

et al. 2023

Front. Digit. Health

Self Cite

View full text Add to dashboard Cite

BackgroundNatural language processing (NLP) has the potential to automate the reading of radiology reports, but there is a need to demonstrate that NLP methods are adaptable and reliable for use in real-world clinical applications.MethodsWe tested the F1 score, precision, and recall to compare NLP tools on a cohort from a study on delirium using images and radiology reports from NHS Fife and a population-based cohort (Generation Scotland) that spans multiple National Health Service health boards. We compared four off-the-shelf rule-based and neural NLP tools (namely, EdIE-R, ALARM+, ESPRESSO, and Sem-EHR) and reported on their performance for three cerebrovascular phenotypes, namely, ischaemic stroke, small vessel disease (SVD), and atrophy. Clinical experts from the EdIE-R team defined phenotypes using labelling techniques developed in the development of EdIE-R, in conjunction with an expert researcher who read underlying images.ResultsEdIE-R obtained the highest F1 score in both cohorts for ischaemic stroke, ≥93%, followed by ALARM+, ≥87%. The F1 score of ESPRESSO was ≥74%, whilst that of Sem-EHR is ≥66%, although ESPRESSO had the highest precision in both cohorts, 90% and 98%. For F1 scores for SVD, EdIE-R scored ≥98% and ALARM+ ≥90%. ESPRESSO scored lowest with ≥77% and Sem-EHR ≥81%. In NHS Fife, F1 scores for atrophy by EdIE-R and ALARM+ were 99%, dropping in Generation Scotland to 96% for EdIE-R and 91% for ALARM+. Sem-EHR performed lowest for atrophy at 89% in NHS Fife and 73% in Generation Scotland. When comparing NLP tool output with brain image reads using F1 scores, ALARM+ scored 80%, outperforming EdIE-R at 66% in ischaemic stroke. For SVD, EdIE-R performed best, scoring 84%, with Sem-EHR 82%. For atrophy, EdIE-R and both ALARM+ versions were comparable at 80%.ConclusionsThe four NLP tools show varying F1 (and precision/recall) scores across all three phenotypes, although more apparent for ischaemic stroke. If NLP tools are to be used in clinical settings, this cannot be performed “out of the box.” It is essential to understand the context of their development to assess whether they are suitable for the task at hand or whether further training, re-training, or modification is required to adapt tools to the target task.

show abstract

Section: Discussionmentioning

confidence: 71%

Section: Discussionmentioning

confidence: 99%

Section: Nlp Toolsmentioning

confidence: 99%

See 1 more Smart Citation

Understanding the performance and reliability of NLP tools: a comparison of four NLP tools predicting stroke phenotypes in radiology reports

Casey,

Davidson,

Grover

et al. 2023

Front. Digit. Health

Self Cite

View full text Add to dashboard Cite

show abstract

“…However, performance of these models is often lower than what would be required clinically without additional feature engineering 13,15 or fine-tuning on thousands of manually-derived labels 14,16 specific to the task. This likely reflects the fact that medical report text has specific structure and meaning while comprising only a small proportion of the general language used to train these models.…”

Section: Many Published Methods For Extraction Of Multiple Values Use...mentioning

confidence: 99%

“…Machine learning has been used on clinical reports but not specifically for extraction of concepts from echocardiogram reports. Many have used various implementations of BERT (Bidirectional Encoder Representations from Transformers), an early large language model (LLM), to extract radiographic clinical findings 13 , mentions of devices 14 , study characteristics 15 , and result keywords 16 from radiology or pathology reports.…”

Section: Introductionmentioning

confidence: 99%

Mapping echocardiogram reports to a structured ontology: a task for statistical machine learning or large language models?

Subramaniam,

Rizvi,

Ramesh

et al. 2024

Preprint

View full text Add to dashboard Cite

BackgroundBig data has the potential to revolutionize echocardiography by enabling novel research and rigorous, scalable quality improvement. Text reports are a critical part of such analyses, and ontology is a key strategy for promoting interoperability of heterogeneous data through consistent tagging. Currently, echocardiogram reports include both structured and free text and vary across institutions, hampering attempts to mine text for useful insights. Natural language processing (NLP) can help and includes both non-deep learning and deep-learning (e.g., large language model, or LLM) based techniques. Challenges to date in using echo text with LLMs include small corpus size, domain-specific language, and high need for accuracy and clinical meaning in model results.MethodsWe tested whether we could map echocardiography text to a structured, three-level hierarchical ontology using NLP. We used two methods: statistical machine learning (EchoMap) and one-shot inference using the Generative Pre-trained Transformer (GPT) large language model. We tested against eight datasets from 24 different institutions and compared both methods against clinician-scored ground truth.ResultsDespite all adhering to clinical guidelines, there were notable differences by institution in what information was included in data dictionaries for structured reporting. EchoMap performed best in mapping test set sentences to the ontology, with validation accuracy of 98% for the first level of the ontology, 93% for the first and second level, and 79% for the first, second, and third levels. EchoMap retained good performance across external test datasets and displayed the ability to extrapolate to examples not initially included in training. EchoMap’s accuracy was comparable to one-shot GPT at the first level of the ontology and outperformed GPT at second and third levels.ConclusionsWe show that statistical machine learning can achieve good performance on text mapping tasks and may be especially useful for small, specialized text datasets. Furthermore, this work highlights the utility of a high-resolution, standardized cardiac ontology to harmonize reports across institutions.

show abstract

Clinically Focussed Evaluation of Anomaly Detection and Localisation Methods Using Inpatient CT Head Data

Kascenas,

Wang,

Schrempf

et al. 2024

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Templated Text Synthesis for Expert-Guided Multi-Label Extraction from Radiology Reports

Cited by 8 publications

References 30 publications

Understanding the performance and reliability of NLP tools: a comparison of four NLP tools predicting stroke phenotypes in radiology reports

Understanding the performance and reliability of NLP tools: a comparison of four NLP tools predicting stroke phenotypes in radiology reports

Mapping echocardiogram reports to a structured ontology: a task for statistical machine learning or large language models?

Clinically Focussed Evaluation of Anomaly Detection and Localisation Methods Using Inpatient CT Head Data

Contact Info

Product

Resources

About