Evaluation and mitigation of the limitations of large language models in clinical decision-making

Hager, Paul; Jungmann, Friederike; Holland, Robbie; Bhagat, Kunal; Hubrecht, Inga; Knauer, Manuel; Vielhauer, Jakob; Makowski, Marcus; Braren, Rickmer; Kaissis, Georgios; Rueckert, Daniel

doi:10.1038/s41591-024-03097-1

Cited by 32 publications

(1 citation statement)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We did not undertake fine-tuning or prompt-tuning in this analysis; these procedures may increase performance on specific clinical decision-making tasks. 21 Therefore, it may be possible to increase overall performance, and it is possible that performance may improve with future versions of GPT or with specialized LLMs. However, the approach we present here is similar to that of the 23 previous studies summarized in Figure 2A and supplemental Table 1.…”

Section: Limitationsmentioning

confidence: 99%

Systematic benchmarking demonstrates large language models have not reached the diagnostic accuracy of traditional rare-disease decision support tools

Reese,

Chimirri,

Bridges

et al. 2024

Preprint

View full text Add to dashboard Cite

Large language models (LLM) have shown great promise in supporting differential diagnosis, but 23 available published studies on the diagnostic accuracy evaluated small cohorts (number of cases, 30-422, mean 104) and have evaluated LLM responses subjectively by manual curation (23/23 studies). The performance of LLMs for rare disease diagnosis has not been evaluated systematically. Here, we perform a rigorous and large-scale analysis of the performance of a GPT-4 in prioritizing candidate diagnoses, using the largest-ever cohort of rare disease patients. Our computational study used 5267 computational case reports from previously published data. Each case was formatted as a Global Alliance for Genomics and Health (GA4GH) phenopacket, in which clinical anomalies were represented as Human Phenotype Ontology (HPO) terms. We developed software to generate prompts from each phenopacket. Prompts were sent to Generative Pre-trained Transformer 4 (GPT-4), and the rank of the correct diagnosis, if present in the response, was recorded.The mean reciprocal rank of the correct diagnosis was 0.24 (with the reciprocal of the MRR corresponding to a rank of 4.2), and the correct diagnosis was placed in rank 1 in 19.2% of the cases, in the first 3 ranks in 28.6%, and in the first 10 ranks in 32.5%. Our study is the largest to be reported to date and provides a realistic estimate of the performance of GPT-4 in rare disease medicine.

show abstract

Section: Limitationsmentioning

confidence: 99%

Systematic benchmarking demonstrates large language models have not reached the diagnostic accuracy of traditional rare-disease decision support tools

Reese,

Chimirri,

Bridges

et al. 2024

Preprint

View full text Add to dashboard Cite

show abstract

Testing and Evaluation of Health Care Applications of Large Language Models

Bedi,

Liu,

Orr-Ewing

et al. 2024

JAMA

View full text Add to dashboard Cite

ImportanceLarge language models (LLMs) can assist in various health care activities, but current evaluation approaches may not adequately identify the most useful application areas.ObjectiveTo summarize existing evaluations of LLMs in health care in terms of 5 components: (1) evaluation data type, (2) health care task, (3) natural language processing (NLP) and natural language understanding (NLU) tasks, (4) dimension of evaluation, and (5) medical specialty.Data SourcesA systematic search of PubMed and Web of Science was performed for studies published between January 1, 2022, and February 19, 2024.Study SelectionStudies evaluating 1 or more LLMs in health care.Data Extraction and SynthesisThree independent reviewers categorized studies via keyword searches based on the data used, the health care tasks, the NLP and NLU tasks, the dimensions of evaluation, and the medical specialty.ResultsOf 519 studies reviewed, published between January 1, 2022, and February 19, 2024, only 5% used real patient care data for LLM evaluation. The most common health care tasks were assessing medical knowledge such as answering medical licensing examination questions (44.5%) and making diagnoses (19.5%). Administrative tasks such as assigning billing codes (0.2%) and writing prescriptions (0.2%) were less studied. For NLP and NLU tasks, most studies focused on question answering (84.2%), while tasks such as summarization (8.9%) and conversational dialogue (3.3%) were infrequent. Almost all studies (95.4%) used accuracy as the primary dimension of evaluation; fairness, bias, and toxicity (15.8%), deployment considerations (4.6%), and calibration and uncertainty (1.2%) were infrequently measured. Finally, in terms of medical specialty area, most studies were in generic health care applications (25.6%), internal medicine (16.4%), surgery (11.4%), and ophthalmology (6.9%), with nuclear medicine (0.6%), physical medicine (0.4%), and medical genetics (0.2%) being the least represented.Conclusions and RelevanceExisting evaluations of LLMs mostly focus on accuracy of question answering for medical examinations, without consideration of real patient care data. Dimensions such as fairness, bias, and toxicity and deployment considerations received limited attention. Future evaluations should adopt standardized applications and metrics, use clinical data, and broaden focus to include a wider range of tasks and specialties.

show abstract

Large Language Models—Misdiagnosing Diagnostic Excellence?

Ranji

2024

JAMA Netw Open

View full text Add to dashboard Cite

When the results of the Goh et al study 1 were presented at a recent National Academies of Medicine meeting, the audience was amazed-and concerned. The randomized clinical trial assessed diagnostic performance by generalist physicians, who were asked to provide diagnoses for 6 simulated cases using either conventional online resources or a large language model (LLM) (ChatGPT Plus [GPT-4]; OpenAI) in addition to standard resources. The study also evaluated the ability of the LLM to solve the cases alone. The authors developed a rubric for measuring diagnostic performance in which blinded experts evaluated participants' overall clinical reasoning process, including their proposed final diagnosis, their differential diagnosis, and factors supporting or + Related articleAuthor affiliations and article information are listed at the end of this article.

show abstract

Evaluation and mitigation of the limitations of large language models in clinical decision-making

Cited by 32 publications

References 37 publications

Systematic benchmarking demonstrates large language models have not reached the diagnostic accuracy of traditional rare-disease decision support tools

Systematic benchmarking demonstrates large language models have not reached the diagnostic accuracy of traditional rare-disease decision support tools

Testing and Evaluation of Health Care Applications of Large Language Models

Large Language Models—Misdiagnosing Diagnostic Excellence?

Contact Info

Product

Resources

About