Six ways large language models are changing healthcare

Webster, Paul

doi:10.1038/s41591-023-02700-1

Cited by 22 publications

(3 citation statements)

References 4 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It should be noted that the interviews were conducted prior to the release of ChatGPT and other chatbots powered by large language models (LLMs). 30 Interviews were audio recorded and transcribed and were analysed in their original language using conventional content analysis with the assistance of the qualitative software MAXQDA (VERBI Software). 31 …”

Section: Methodsmentioning

confidence: 99%

Building a house without foundations? A 24-country qualitative interview study on artificial intelligence in intensive care medicine

McLennan,

Fiske,

Celi

2024

BMJ Health Care Inform

View full text Add to dashboard Cite

ObjectivesTo explore the views of intensive care professionals in high-income countries (HICs) and lower-to-middle-income countries (LMICs) regarding the use and implementation of artificial intelligence (AI) technologies in intensive care units (ICUs).MethodsIndividual semi-structured qualitative interviews were conducted between December 2021 and August 2022 with 59 intensive care professionals from 24 countries. Transcripts were analysed using conventional content analysis.ResultsParticipants had generally positive views about the potential use of AI in ICUs but also reported some well-known concerns about the use of AI in clinical practice and important technical and non-technical barriers to the implementation of AI. Important differences existed between ICUs regarding their current readiness to implement AI. However, these differences were not primarily between HICs and LMICs, but between a small number of ICUs in large tertiary hospitals in HICs, which were reported to have the necessary digital infrastructure for AI, and nearly all other ICUs in both HICs and LMICs, which were reported to neither have the technical capability to capture the necessary data or use AI, nor the staff with the right knowledge and skills to use the technology.ConclusionPouring massive amounts of resources into developing AI without first building the necessary digital infrastructure foundation needed for AI is unethical. Real-world implementation and routine use of AI in the vast majority of ICUs in both HICs and LMICs included in our study is unlikely to occur any time soon. ICUs should not be using AI until certain preconditions are met.

show abstract

Section: Methodsmentioning

confidence: 99%

Building a house without foundations? A 24-country qualitative interview study on artificial intelligence in intensive care medicine

McLennan,

Fiske,

Celi

2024

BMJ Health Care Inform

View full text Add to dashboard Cite

show abstract

“…Within the realm of generative AI, LLMs produce structured, coherent prose in response to text inputs, with broad application in health system operations 6 . Prominent applications such as facilitating clinical note-taking have already been implemented by several health systems in the U.S., and there is excitement in the medical community for improving healthcare efficiency, quality, and patient outcomes 7 8 . A recent report estimates that LLMs could unlock a substantial portion of the $1 trillion in untapped healthcare efficiency improvements, including an estimated savings ranging from 5 to 10 percent of US healthcare spending or approximately $200 billion to $360 billion annually based on 2019 figures 9 10 .…”

Section: Introductionmentioning

confidence: 99%

A Systematic Review of Testing and Evaluation of Healthcare Applications of Large Language Models (LLMs)

Bedi,

Liu,

Orr-Ewing

et al. 2024

Preprint

View full text Add to dashboard Cite

Importance: Large Language Models (LLMs) can assist in a wide range of healthcare-related activities. Current approaches to evaluating LLMs make it difficult to identify the most impactful LLM application areas. Objective: To summarize the current evaluation of LLMs in healthcare in terms of 5 components: evaluation data type, healthcare task, Natural Language Processing (NLP)/Natural Language Understanding (NLU) task, dimension of evaluation, and medical specialty. Data Sources: A systematic search of PubMed and Web of Science was performed for studies published between 01-01-2022 and 02-19-2024. Study Selection: Studies evaluating one or more LLMs in healthcare. Data Extraction and Synthesis: Three independent reviewers categorized 519 studies in terms of data used in the evaluation, the healthcare tasks (the what) and the NLP/NLU tasks (the how) examined, the dimension(s) of evaluation, and the medical specialty studied. Results: Only 5% of reviewed studies utilized real patient care data for LLM evaluation. The most popular healthcare tasks were assessing medical knowledge (e.g. answering medical licensing exam questions, 44.5%), followed by making diagnoses (19.5%), and educating patients (17.7%). Administrative tasks such as assigning provider billing codes (0.2%), writing prescriptions (0.2%), generating clinical referrals (0.6%) and clinical notetaking (0.8%) were less studied. For NLP/NLU tasks, the vast majority of studies examined question answering (84.2%). Other tasks such as summarization (8.9%), conversational dialogue (3.3%), and translation (3.1%) were infrequent. Almost all studies (95.4%) used accuracy as the primary dimension of evaluation; fairness, bias and toxicity (15.8%), robustness (14.8%), deployment considerations (4.6%), and calibration and uncertainty (1.2%) were infrequently measured. Finally, in terms of medical specialty area, most studies were in internal medicine (42%), surgery (11.4%) and ophthalmology (6.9%), with nuclear medicine (0.6%), physical medicine (0.4%) and medical genetics (0.2%) being the least represented. Conclusions and Relevance: Existing evaluations of LLMs mostly focused on accuracy of question answering for medical exams, without consideration of real patient care data. Dimensions like fairness, bias and toxicity, robustness, and deployment considerations received limited attention. To draw meaningful conclusions and improve LLM adoption, future studies need to establish a standardized set of LLM applications and evaluation dimensions, perform evaluations using data from routine care, and broaden testing to include administrative tasks as well as multiple medical specialties. Keywords: Large Language Models, Generative Artificial Intelligence, Healthcare, Dimensions of Evaluation, Evaluation Metrics.

show abstract

“…Large language models (LLMs) have the potential to improve healthcare due to their capability to parse complex concepts and generate appropriate responses. LLMs have demonstrated proficiency in tasks across the spectrum of clinical activity, such as medical inquiry responses, dialogue systems, and the synthesis and completion of clinical reports 1 – 5 . One potential high-value area for LLMs is the ability to promote evidence-based practice through providing clinical decision support systems (CDSSs) according to current medical guidelines, which are distillations of both expert opinion and current evidence from clinical trials and are used to drive improvements in patient outcomes through best practices 6 , 7 .…”

Section: Introductionmentioning

confidence: 99%

Optimization of hepatological clinical guidelines interpretation by large language models: a retrieval augmented generation-based framework

Kresevic,

Giuffrè,

Ajcevic

et al. 2024

npj Digit. Med.

View full text Add to dashboard Cite

Large language models (LLMs) can potentially transform healthcare, particularly in providing the right information to the right provider at the right time in the hospital workflow. This study investigates the integration of LLMs into healthcare, specifically focusing on improving clinical decision support systems (CDSSs) through accurate interpretation of medical guidelines for chronic Hepatitis C Virus infection management. Utilizing OpenAI’s GPT-4 Turbo model, we developed a customized LLM framework that incorporates retrieval augmented generation (RAG) and prompt engineering. Our framework involved guideline conversion into the best-structured format that can be efficiently processed by LLMs to provide the most accurate output. An ablation study was conducted to evaluate the impact of different formatting and learning strategies on the LLM’s answer generation accuracy. The baseline GPT-4 Turbo model’s performance was compared against five experimental setups with increasing levels of complexity: inclusion of in-context guidelines, guideline reformatting, and implementation of few-shot learning. Our primary outcome was the qualitative assessment of accuracy based on expert review, while secondary outcomes included the quantitative measurement of similarity of LLM-generated responses to expert-provided answers using text-similarity scores. The results showed a significant improvement in accuracy from 43 to 99% (p < 0.001), when guidelines were provided as context in a coherent corpus of text and non-text sources were converted into text. In addition, few-shot learning did not seem to improve overall accuracy. The study highlights that structured guideline reformatting and advanced prompt engineering (data quality vs. data quantity) can enhance the efficacy of LLM integrations to CDSSs for guideline delivery.

show abstract

Six ways large language models are changing healthcare

Cited by 22 publications

References 4 publications

Building a house without foundations? A 24-country qualitative interview study on artificial intelligence in intensive care medicine

Building a house without foundations? A 24-country qualitative interview study on artificial intelligence in intensive care medicine

A Systematic Review of Testing and Evaluation of Healthcare Applications of Large Language Models (LLMs)

Optimization of hepatological clinical guidelines interpretation by large language models: a retrieval augmented generation-based framework

Contact Info

Product

Resources

About