Large language models are few-shot clinical information extractors

Agrawal, Monica; Hegselmann, Stefan; Lang, Hunter; Kim, Yoon; Sontag, David

doi:10.18653/v1/2022.emnlp-main.130

Cited by 111 publications

(25 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…One of the key aspects of prompt engineering is the number of examples or shots that are provided to the model along with the prompt. Few-shot prompting is a technique that provides the model with a few examples of input-output pairs, while zero-shot prompting does not provide any examples [ 3 , 18 ]. By contrasting these strategies, we aim to shed light on the most efficient and effective ways to leverage prompt engineering in clinical NLP.…”

Section: Introductionmentioning

confidence: 99%

An Empirical Evaluation of Prompting Strategies for Large Language Models in Zero-Shot Clinical Natural Language Processing: Algorithm Development and Validation Study

Sivarajkumar,

Kelley,

Samolyk-Mazzanti

et al. 2024

JMIR Med Inform

View full text Add to dashboard Cite

Background Large language models (LLMs) have shown remarkable capabilities in natural language processing (NLP), especially in domains where labeled data are scarce or expensive, such as the clinical domain. However, to unlock the clinical knowledge hidden in these LLMs, we need to design effective prompts that can guide them to perform specific clinical NLP tasks without any task-specific training data. This is known as in-context learning, which is an art and science that requires understanding the strengths and weaknesses of different LLMs and prompt engineering approaches. Objective The objective of this study is to assess the effectiveness of various prompt engineering techniques, including 2 newly introduced types—heuristic and ensemble prompts, for zero-shot and few-shot clinical information extraction using pretrained language models. Methods This comprehensive experimental study evaluated different prompt types (simple prefix, simple cloze, chain of thought, anticipatory, heuristic, and ensemble) across 5 clinical NLP tasks: clinical sense disambiguation, biomedical evidence extraction, coreference resolution, medication status extraction, and medication attribute extraction. The performance of these prompts was assessed using 3 state-of-the-art language models: GPT-3.5 (OpenAI), Gemini (Google), and LLaMA-2 (Meta). The study contrasted zero-shot with few-shot prompting and explored the effectiveness of ensemble approaches. Results The study revealed that task-specific prompt tailoring is vital for the high performance of LLMs for zero-shot clinical NLP. In clinical sense disambiguation, GPT-3.5 achieved an accuracy of 0.96 with heuristic prompts and 0.94 in biomedical evidence extraction. Heuristic prompts, alongside chain of thought prompts, were highly effective across tasks. Few-shot prompting improved performance in complex scenarios, and ensemble approaches capitalized on multiple prompt strengths. GPT-3.5 consistently outperformed Gemini and LLaMA-2 across tasks and prompt types. Conclusions This study provides a rigorous evaluation of prompt engineering methodologies and introduces innovative techniques for clinical information extraction, demonstrating the potential of in-context learning in the clinical domain. These findings offer clear guidelines for future prompt-based clinical NLP research, facilitating engagement by non-NLP experts in clinical NLP advancements. To the best of our knowledge, this is one of the first works on the empirical evaluation of different prompt engineering approaches for clinical NLP in this era of generative artificial intelligence, and we hope that it will inspire and inform future research in this area.

show abstract

Section: Introductionmentioning

confidence: 99%

An Empirical Evaluation of Prompting Strategies for Large Language Models in Zero-Shot Clinical Natural Language Processing: Algorithm Development and Validation Study

Sivarajkumar,

Kelley,

Samolyk-Mazzanti

et al. 2024

JMIR Med Inform

View full text Add to dashboard Cite

show abstract

“…However, these large-scale data are often unstructured, requiring extensive processing and labeling, which poses the most signi cant bottleneck (12). In a precise eld such as medicine, errors in the labeling and preprocessing process can lead to poor outcomes in terms of the reliability of AI models and the impact of model results (11,(13)(14)(15)(16)(17). Therefore, domain experts are often employed for labeling tasks in the present day, a process that is both time-consuming and costly (6, 18).…”

Section: Introductionmentioning

confidence: 99%

“…Unstructured EHRs are characterized by a wide array of data formats, including free-text clinical notes, laboratory ndings, and imaging narratives. Each of these formats exhibits unique terminological and syntactical features, ambiguous jargon, and nonstandard phrasal structures (17,(19)(20)(21)(22)(23). To mitigate such complexity, the encoding of patients' diseases in EHRs using universally accepted disease classi cation coding systems such as the International Classi cation of Disease (ICD) facilitates the clustering of patients, providing convenience.…”

Section: Introductionmentioning

confidence: 99%

Human-Like Named Entity Recognition with Large Language Models in Unstructured Text-based Electronic Healthcare Records: An Evaluation Study

Akbasli,

Birbilen,

Teksam

2024

Preprint

View full text Add to dashboard Cite

Background The integration of big data and artificial intelligence (AI) in healthcare, particularly through the analysis of electronic health records (EHR), presents significant opportunities for improving diagnostic accuracy and patient outcomes. However, the challenge of processing and accurately labeling vast amounts of unstructured data remains a critical bottleneck, necessitating efficient and reliable solutions. This study investigates the ability of domain specific, fine-tuned large language models (LLMs) to classify unstructured EHR texts with typographical errors through named entity recognition tasks, aiming to improve the efficiency and reliability of supervised learning AI models in healthcare. Methods Clinical notes from pediatric emergency room admissions at Hacettepe University İhsan Doğramacı Children's Hospital from 2018 to 2023 were analyzed. The data were preprocessed with open source Python libraries and categorized using a pretrained GPT-3 model, "text-davinci-003," before and after fine-tuning with domain-specific data on respiratory tract infections (RTI). The model's predictions were compared against ground truth labels established by pediatric specialists. Results Out of 24,229 patient records classified as "Others ()", 18,879 were identified without typographical errors and confirmed for RTI through filtering methods. The fine-tuned model achieved a 99.96% accuracy, significantly outperforming the pretrained model's 78.54% accuracy in identifying RTI cases among the remaining records. The fine-tuned model demonstrated superior performance metrics across all evaluated aspects compared to the pretrained model. Conclusions Fine-tuned LLMs can categorize unstructured EHR data with high accuracy, closely approximating the performance of domain experts. This approach significantly reduces the time and costs associated with manual data labeling, demonstrating the potential to streamline the processing of large-scale healthcare data for AI applications.

show abstract

“…nstruction-tuned large language models (LLMs) have been successful at knowledge retrieval, 1 -4 text extraction, [5][6][7][8][9] summarization, [10][11][12] and reasoning [13][14][15][16][17] tasks without requiring domain-specific fine-tuning. Prompting LLMs with instruction and data contexts described in natural language has emerged as a means for task and domain specification as well as controllability of model behaviors.…”

mentioning

confidence: 99%

“…Because there is no single postoperative outcome measure of risk, LLM capabilities were surveyed on 8 different tasks: (1) assignment of the American Society of Anesthesiologists Physical Status (ASA-PS) classification, [25][26][27] (2) prediction of postanesthesia care unit (PACU) phase 1 duration, (3) hospital admission, (4) hospital duration, (5) intensive care unit (ICU) admission, (6) ICU duration, (7) whether the patient will have an unanticipated hospital admission, and (8) whether the patient will die in the hospital. The LLM-generated responses were compared against ground-truth labels extracted from patients' EHR, and performance metrics were reported based on this comparison (Figure 1).…”

mentioning

confidence: 99%

Large Language Model Capabilities in Perioperative Risk Prediction and Prognostication

Chung,

Fong,

Walters

et al. 2024

JAMA Surg

View full text Add to dashboard Cite

ImportanceGeneral-domain large language models may be able to perform risk stratification and predict postoperative outcome measures using a description of the procedure and a patient’s electronic health record notes.ObjectiveTo examine predictive performance on 8 different tasks: prediction of American Society of Anesthesiologists Physical Status (ASA-PS), hospital admission, intensive care unit (ICU) admission, unplanned admission, hospital mortality, postanesthesia care unit (PACU) phase 1 duration, hospital duration, and ICU duration.Design, Setting, and ParticipantsThis prognostic study included task-specific datasets constructed from 2 years of retrospective electronic health records data collected during routine clinical care. Case and note data were formatted into prompts and given to the large language model GPT-4 Turbo (OpenAI) to generate a prediction and explanation. The setting included a quaternary care center comprising 3 academic hospitals and affiliated clinics in a single metropolitan area. Patients who had a surgery or procedure with anesthesia and at least 1 clinician-written note filed in the electronic health record before surgery were included in the study. Data were analyzed from November to December 2023.ExposuresCompared original notes, note summaries, few-shot prompting, and chain-of-thought prompting strategies.Main Outcomes and MeasuresF1 score for binary and categorical outcomes. Mean absolute error for numerical duration outcomes.ResultsStudy results were measured on task-specific datasets, each with 1000 cases with the exception of unplanned admission, which had 949 cases, and hospital mortality, which had 576 cases. The best results for each task included an F1 score of 0.50 (95% CI, 0.47-0.53) for ASA-PS, 0.64 (95% CI, 0.61-0.67) for hospital admission, 0.81 (95% CI, 0.78-0.83) for ICU admission, 0.61 (95% CI, 0.58-0.64) for unplanned admission, and 0.86 (95% CI, 0.83-0.89) for hospital mortality prediction. Performance on duration prediction tasks was universally poor across all prompt strategies for which the large language model achieved a mean absolute error of 49 minutes (95% CI, 46-51 minutes) for PACU phase 1 duration, 4.5 days (95% CI, 4.2-5.0 days) for hospital duration, and 1.1 days (95% CI, 0.9-1.3 days) for ICU duration prediction.Conclusions and RelevanceCurrent general-domain large language models may assist clinicians in perioperative risk stratification on classification tasks but are inadequate for numerical duration predictions. Their ability to produce high-quality natural language explanations for the predictions may make them useful tools in clinical workflows and may be complementary to traditional risk prediction models.

show abstract

Large language models are few-shot clinical information extractors

Cited by 111 publications

References 0 publications

An Empirical Evaluation of Prompting Strategies for Large Language Models in Zero-Shot Clinical Natural Language Processing: Algorithm Development and Validation Study

An Empirical Evaluation of Prompting Strategies for Large Language Models in Zero-Shot Clinical Natural Language Processing: Algorithm Development and Validation Study

Human-Like Named Entity Recognition with Large Language Models in Unstructured Text-based Electronic Healthcare Records: An Evaluation Study

Large Language Model Capabilities in Perioperative Risk Prediction and Prognostication

Contact Info

Product

Resources

About