Harvard Data Science Review 2024
DOI: 10.1162/99608f92.5317da47
|View full text |Cite
|
Sign up to set email alerts
|

How Is ChatGPT’s Behavior Changing Over Time?

Lingjiao Chen,
Matei Zaharia,
James Zou

Abstract: GPT-3.5 and GPT-4 are the two most widely used large language model (LLM) services. However, when and how these models are updated over time is opaque. Here, we evaluate the March 2023 and June 2023 versions of GPT-3.5 and GPT-4 on several diverse tasks: 1) math problems, 2) sensitive/dangerous questions, 3) opinion surveys, 4) multi-hop knowledge-intensive questions, 5) generating code, 6) U.S. Medical License tests, and 7) visual reasoning. We find that the performance and behavior of both GPT-3.5 and GPT-4 … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

1
1
0

Year Published

2024
2024
2024
2024

Publication Types

Select...
5
1
1

Relationship

0
7

Authors

Journals

citations
Cited by 76 publications
(2 citation statements)
references
References 19 publications
1
1
0
Order By: Relevance
“…Another important observation is that GPT-4o performs better than GPT-4 and significantly better than GPT-3.5, reflecting the advancements in natural language processing, contextual comprehension, and response generation capabilities in these successive model iterations. This supports previous research, which shows that the behavior of the 'same' LLM service can change substantially in a relatively short period via updates to the models 41 . we employ the following metrics: for "diabetes detection" and "high glucose detection," we use the F1-score; for the "glucose correlation" task, we utilize three types of correlations: Pearson, Spearman, and cross-correlation; and for the "age prediction" task, we use the Mean Absolute Error (MAE).…”
Section: Llms Performance Evaluationsupporting
confidence: 90%
“…Another important observation is that GPT-4o performs better than GPT-4 and significantly better than GPT-3.5, reflecting the advancements in natural language processing, contextual comprehension, and response generation capabilities in these successive model iterations. This supports previous research, which shows that the behavior of the 'same' LLM service can change substantially in a relatively short period via updates to the models 41 . we employ the following metrics: for "diabetes detection" and "high glucose detection," we use the F1-score; for the "glucose correlation" task, we utilize three types of correlations: Pearson, Spearman, and cross-correlation; and for the "age prediction" task, we use the Mean Absolute Error (MAE).…”
Section: Llms Performance Evaluationsupporting
confidence: 90%
“…For example, the June 2023 version of GPT-4 performed significantly poorer on some mathematical tasks than its March 2023 version, whereas for GPT-3.5, it was the opposite. And "both GPT-4 and GPT-3.5 had more formatting mistakes in code generation in June than in March" (Chen et al, 2024). Secondly, performance changes can be rather significant (e.g., over 30 percentage points) in such a short period.…”
Section: From Chatgpt Users To Generative Ai Researchersmentioning
confidence: 99%