Assessing GPT-3.5 and GPT-4 in Generating International Classification of Diseases Billing Codes

Soroush, Ali; Glicksberg, Benjamin S.; Zimlichman, Eyal; Barash, Yiftach; Freeman, Robert; Charney, Alexander W.; Nadkarni, Girish N; Klang, Eyal

doi:10.1101/2023.07.07.23292391

Cited by 6 publications

(1 citation statement)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Performance of these models for specialized healthcare-related tasks like assigning ICD codes to clinical notes remains subpar. Saroush, et al (32) demonstrated that prompting GPT-3.5 and − 4 via the ChatGPT interface by providing descriptions of the ICD-10 code predicted the correct ICD-10 codes only 10% (GPT-3.5) and 13% (GPT-4) of the time. Boyle, et al(28) observed similar results.…”

Section: Discussionmentioning

confidence: 99%

Fine-Tuning for Accuracy: Evaluation of GPT for Automatic Assignment of ICD Codes to Clinical Documentation

Nawab,

Fernbach,

Atreya

et al. 2024

Preprint

View full text Add to dashboard Cite

Background: Assignment of International Classification of Disease (ICD) codes to clinical documentation is a tedious but important task that is mostly done manually. This study evaluated the widely popular OpenAI’s Generative Pretrained Model (GPT) 3.5 Turbo in facilitating the automation of assigning ICD codes to clinical notes. Methods: We identified the 10 most prevalent ICD-10 codes in the Medical Information Mart for Intensive Care (MIMIC-IV) dataset. We selected 200 notes for each code, and then split them equally into two groups of 100 each (randomly selected) for training and testing. We then passed each note to GPT 3.5 Turbo via OpenAI’s API, prompting the model to assign ICD-10 codes to each note. We evaluated the model’s response for the presence of the target ICD-10 code. After fine-tuning the GPT model on the training data, we repeated the process with the test data, comparing the fine-tuned model’s performance against the default model. Results: Initially the target ICD-10 code was present in the assigned codes by the default GPT 3.5 Turbo model in 29.7% of the cases. After fine-tuning with 100 notes for each top code, the accuracy improved to 62.6%. Conclusions: Historically, GPT’s performance for healthcare related tasks is sub-optimal. Fine-tuning as in this study provides great potential for improved performance, highlighting a path forward for integration of Artificial Intelligence (AI) in healthcare for improved efficiency and accuracy of this administrative task. Future research should focus on expanding the training datasets with specialized data and exploring the potential integration of these models into existing healthcare systems to maximize their utility and reliability.

show abstract

Section: Discussionmentioning

confidence: 99%

Fine-Tuning for Accuracy: Evaluation of GPT for Automatic Assignment of ICD Codes to Clinical Documentation

Nawab,

Fernbach,

Atreya

et al. 2024

Preprint

View full text Add to dashboard Cite

show abstract

Coding Historical Causes of Death Data with Large Language Models

Pedersen,

Islam,

Kristoffersen

et al. 2024

Lecture Notes in Computer Science

View full text Add to dashboard Cite

This paper investigates the feasibility of using pre-trained generative Large Language Models (LLMs) to automate the assignment of ICD-10 codes to historical causes of death. Due to the complex narratives often found in historical causes of death, this task has traditionally been manually performed by coding experts. We evaluate the ability of GPT-3.5, GPT-4, and Llama 2 LLMs to accurately assign ICD-10 codes on the HiCaD dataset that contains causes of death recorded in the civil death register entries of 19,361 individuals from Ipswich, Kilmarnock, and the Isle of Skye in the UK between 1861–1901. Our findings show that GPT-3.5, GPT-4, and Llama 2 assign the correct code for 69%, 83%, and 40% of causes, respectively. However, we achieve a maximum accuracy of 89% by standard machine learning techniques. All LLMs performed better for causes of death that contained terms still in use today, compared to archaic terms. Also, they performed better for short causes (1–2 words) compared to longer causes. LLMs therefore do not currently perform well enough for historical ICD-10 code assignment tasks. We suggest further fine-tuning or alternative frameworks to achieve adequate performance.

show abstract

Large language models for structured reporting in radiology: past, present, and future

Busch,

Hoffmann,

dos Santos

et al. 2024

Eur Radiol

View full text Add to dashboard Cite

Structured reporting (SR) has long been a goal in radiology to standardize and improve the quality of radiology reports. Despite evidence that SR reduces errors, enhances comprehensiveness, and increases adherence to guidelines, its widespread adoption has been limited. Recently, large language models (LLMs) have emerged as a promising solution to automate and facilitate SR. Therefore, this narrative review aims to provide an overview of LLMs for SR in radiology and beyond. We found that the current literature on LLMs for SR is limited, comprising ten studies on the generative pre-trained transformer (GPT)-3.5 (n = 5) and/or GPT-4 (n = 8), while two studies additionally examined the performance of Perplexity and Bing Chat or IT5. All studies reported promising results and acknowledged the potential of LLMs for SR, with six out of ten studies demonstrating the feasibility of multilingual applications. Building upon these findings, we discuss limitations, regulatory challenges, and further applications of LLMs in radiology report processing, encompassing four main areas: documentation, translation and summarization, clinical evaluation, and data mining. In conclusion, this review underscores the transformative potential of LLMs to improve efficiency and accuracy in SR and radiology report processing. Key Points QuestionHow can LLMs help make SR in radiology more ubiquitous? FindingsCurrent literature leveraging LLMs for SR is sparse but shows promising results, including the feasibility of multilingual applications. Clinical relevanceLLMs have the potential to transform radiology report processing and enable the widespread adoption of SR. However, their future role in clinical practice depends on overcoming current limitations and regulatory challenges, including opaque algorithms and training data.

show abstract

Assessing GPT-3.5 and GPT-4 in Generating International Classification of Diseases Billing Codes

Cited by 6 publications

References 26 publications

Fine-Tuning for Accuracy: Evaluation of GPT for Automatic Assignment of ICD Codes to Clinical Documentation

Fine-Tuning for Accuracy: Evaluation of GPT for Automatic Assignment of ICD Codes to Clinical Documentation

Coding Historical Causes of Death Data with Large Language Models

Large language models for structured reporting in radiology: past, present, and future

Contact Info

Product

Resources

About