How Is ChatGPT’s Behavior Changing Over Time?

Chen, Lingjiao; Zaharia, Matei; Zou, James

doi:10.1162/99608f92.5317da47

Cited by 76 publications

(2 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Another important observation is that GPT-4o performs better than GPT-4 and significantly better than GPT-3.5, reflecting the advancements in natural language processing, contextual comprehension, and response generation capabilities in these successive model iterations. This supports previous research, which shows that the behavior of the 'same' LLM service can change substantially in a relatively short period via updates to the models 41 . we employ the following metrics: for "diabetes detection" and "high glucose detection," we use the F1-score; for the "glucose correlation" task, we utilize three types of correlations: Pearson, Spearman, and cross-correlation; and for the "age prediction" task, we use the Mean Absolute Error (MAE).…”

Section: Llms Performance Evaluationsupporting

confidence: 90%

Perspective on Harnessing Large Language Models to Uncover Insights in Diabetes Wearable Data

Alavi,

Cha,

Esfarjani

et al. 2024

Preprint

View full text Add to dashboard Cite

Large Language Models (LLMs) have gained significant attention and are increasingly used by researchers. Concurrently, publicly accessible datasets containing individual-level health information are becoming more available. Some of these datasets, such as the recently released Artificial Intelligence Ready and Equitable Atlas for Diabetes Insights (AI-READI) dataset, include individual-level data from digital wearable technologies. The application of LLMs to gain insights about health from wearable sensor data specific to diabetes is underexplored. This study presents a comprehensive evaluation of multiple LLMs, including GPT-3.5, GPT-4, GPT-4o, Gemini, Gemini 1.5 Pro, and Claude 3 Sonnet, on various diabetes research tasks using diverse prompting methods to evaluate their performance and gain new insights into diabetes and glucose dysregulation. Notably, GPT-4o showed promising performance across tasks with a chain-of-thought prompt design (aggregate performance score of 95.5%). Moreover, using this model, we identified new insights from the dataset, such as the heightened sensitivity to stress among diabetic participants during glucose level fluctuations, which underscores the complex interplay between metabolic and psychological factors. These results demonstrate that LLMs can enhance the pace of discovery and also enable automated interpretation of data for users of wearable devices, including both the research team and the individual wearing the device. Meanwhile, we also emphasize the critical limitations, such as privacy and ethical risks and dataset biases, that must be resolved for real-world application in diabetes health settings. This study highlights the potential and challenges of integrating LLMs into diabetes research and, more broadly, wearables, paving the way for future healthcare advancements, particularly in disadvantaged communities.

show abstract

Section: Llms Performance Evaluationsupporting

confidence: 90%

Perspective on Harnessing Large Language Models to Uncover Insights in Diabetes Wearable Data

Alavi,

Cha,

Esfarjani

et al. 2024

Preprint

View full text Add to dashboard Cite

show abstract

“…For example, the June 2023 version of GPT-4 performed significantly poorer on some mathematical tasks than its March 2023 version, whereas for GPT-3.5, it was the opposite. And "both GPT-4 and GPT-3.5 had more formatting mistakes in code generation in June than in March" (Chen et al, 2024). Secondly, performance changes can be rather significant (e.g., over 30 percentage points) in such a short period.…”

Section: From Chatgpt Users To Generative Ai Researchersmentioning

confidence: 99%

Data Science Experiences and Emotions:  From Lonesome to Awesome

Meng

2024

Harvard Data Science Review

View full text Add to dashboard Cite

Enhancing Postmarketing Surveillance of Medical Products With Large Language Models

Matheny,

Yang,

Smith

et al. 2024

JAMA Netw Open

View full text Add to dashboard Cite

ImportanceThe Sentinel System is a key component of the US Food and Drug Administration (FDA) postmarketing safety surveillance commitment and uses clinical health care data to conduct analyses to inform drug labeling and safety communications, FDA advisory committee meetings, and other regulatory decisions. However, observational data are frequently deemed insufficient for reliable evaluation of safety concerns owing to limitations in underlying data or methodology. Advances in large language models (LLMs) provide new opportunities to address some of these limitations. However, careful consideration is necessary for how and where LLMs can be effectively deployed for these purposes.ObservationsLLMs may provide new avenues to support signal-identification activities to identify novel adverse event signals from narrative text of electronic health records. These algorithms may be used to support epidemiologic investigations examining the causal relationship between exposure to a medical product and an adverse event through development of probabilistic phenotyping of health outcomes of interest and extraction of information related to important confounding factors. LLMs may perform like traditional natural language processing tools by annotating text with controlled vocabularies with additional tailored training activities. LLMs offer opportunities for enhancing information extraction from adverse event reports, medical literature, and other biomedical knowledge sources. There are several challenges that must be considered when leveraging LLMs for postmarket surveillance. Prompt engineering is needed to ensure that LLM-extracted associations are accurate and specific. LLMs require extensive infrastructure to use, which many health care systems lack, and this can impact diversity, equity, and inclusion, and result in obscuring significant adverse event patterns in some populations. LLMs are known to generate nonfactual statements, which could lead to false positive signals and downstream evaluation activities by the FDA and other entities, incurring substantial cost.Conclusions and RelevanceLLMs represent a novel paradigm that may facilitate generation of information to support medical product postmarket surveillance activities that have not been possible. However, additional work is required to ensure LLMs can be used in a fair and equitable manner, minimize false positive findings, and support the necessary rigor of signal detection needed for regulatory activities.

show abstract

How Is ChatGPT’s Behavior Changing Over Time?

Cited by 76 publications

References 19 publications

Perspective on Harnessing Large Language Models to Uncover Insights in Diabetes Wearable Data

Perspective on Harnessing Large Language Models to Uncover Insights in Diabetes Wearable Data

Data Science Experiences and Emotions:  From Lonesome to Awesome

Enhancing Postmarketing Surveillance of Medical Products With Large Language Models

Contact Info

Product

Resources

About

How Is ChatGPT’s Behavior Changing Over Time?

Cited by 76 publications

References 19 publications

Perspective on Harnessing Large Language Models to Uncover Insights in Diabetes Wearable Data

Perspective on Harnessing Large Language Models to Uncover Insights in Diabetes Wearable Data

Data Science Experiences and Emotions:&nbsp; From Lonesome to Awesome

Enhancing Postmarketing Surveillance of Medical Products With Large Language Models

Contact Info

Product

Resources

About

Data Science Experiences and Emotions: From Lonesome to Awesome