2023
DOI: 10.2139/ssrn.4441311
|View full text |Cite
|
Sign up to set email alerts
|

Re-Evaluating GPT-4's Bar Exam Performance

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

1
3
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 6 publications
(4 citation statements)
references
References 13 publications
1
3
0
Order By: Relevance
“…In this article, the authors assess 2 of the world's most prominent large language models (LLMs) on the Self-Assessment Neurosurgery (SANS) Exams for the American Board of Neurological Surgery (ABNS) written exam and find that the LLMs perform better than the average human test taker. These findings are consistent with the performance of LLMs on other exams, 1a although GPT-4's superhuman results on the Bar Exam 2a have recently been called into question. The challenge of evaluating claims around these models, which themselves have not been reported in any scientific publications due to “competitive concerns,” raises many questions and underscores the importance of independent evaluations such as this project.…”
Section: Commentssupporting
confidence: 72%
“…In this article, the authors assess 2 of the world's most prominent large language models (LLMs) on the Self-Assessment Neurosurgery (SANS) Exams for the American Board of Neurological Surgery (ABNS) written exam and find that the LLMs perform better than the average human test taker. These findings are consistent with the performance of LLMs on other exams, 1a although GPT-4's superhuman results on the Bar Exam 2a have recently been called into question. The challenge of evaluating claims around these models, which themselves have not been reported in any scientific publications due to “competitive concerns,” raises many questions and underscores the importance of independent evaluations such as this project.…”
Section: Commentssupporting
confidence: 72%
“…GPT 4, as per the latest research, demonstrates pro ciency comparable to human performance across a variety of domains, including medicine (Nori et al, 2023), law (Martínez, 2023), and cognitive psychology (Dhingra et al, 2023). This signi cant advancement is presumably attributable to incorporating Reinforcement Learning from Human Feedback (RLHF) during its training phase, coupled with a more voluminous training data corpus.…”
Section: Discussionmentioning
confidence: 96%
“…However, using a percentile chart from a February 2018 exam administration (which is generally available online), ChatGPT would receive a score below the 10th percentile of test-takers while GPT-4 would receive a combined score approaching the 90th percentile of test-takers. However, it should be noted that this chart might not be the best approach to the estimation given the skew towards 'retakers' in the February exam administration[78]. While we are not fully convinced of the methodological approach taken in some subsequent analysis[78], we do agree that it would be better to consider the raw 297 UBE as falling within a range between 68th and 90th percentile (depending on the precise state and timing of the exam administration).…”
mentioning
confidence: 83%
“…However, it should be noted that this chart might not be the best approach to the estimation given the skew towards 'retakers' in the February exam administration[78]. While we are not fully convinced of the methodological approach taken in some subsequent analysis[78], we do agree that it would be better to consider the raw 297 UBE as falling within a range between 68th and 90th percentile (depending on the precise state and timing of the exam administration). See table 8, electronic supplementary material, for additional information.…”
mentioning
confidence: 83%