An Independent Evaluation of ChatGPT on Mathematical Word Problems (MWP)

Shakarian, Paulo; Koyyalamudi, Abhinav; Ngu, Noel; Mareedu, Lakshmivihari

doi:10.48550/arxiv.2302.13814

Cited by 17 publications

(14 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…and reported small shifts (most below 5%) in ChatGPT's performance on some common benchmarks. Other papers Shakarian et al, 2023) also reported shifts in specific problems. Monitoring model performance shifts is an emerging research area for machine-learning-as-a-service (MLaaS) more broadly.…”

Section: Related Workmentioning

confidence: 89%

How Is ChatGPT’s Behavior Changing Over Time?

Chen,

Zaharia,

Zou

2024

Harvard Data Science Review

View full text Add to dashboard Cite

GPT-3.5 and GPT-4 are the two most widely used large language model (LLM) services. However, when and how these models are updated over time is opaque. Here, we evaluate the March 2023 and June 2023 versions of GPT-3.5 and GPT-4 on several diverse tasks: 1) math problems, 2) sensitive/dangerous questions, 3) opinion surveys, 4) multi-hop knowledge-intensive questions, 5) generating code, 6) U.S. Medical License tests, and 7) visual reasoning. We find that the performance and behavior of both GPT-3.5 and GPT-4 can vary greatly over time. For example, GPT-4 (March 2023) was reasonable at identifying prime vs. composite numbers (84% accuracy) but GPT-4 (June 2023) was poor on these same questions (51% accuracy). This is partly explained by a drop in GPT-4's amenity to follow chain-of-thought prompting. Interestingly, GPT-3.5 was much better in June than in March in this task. GPT-4 became less willing to answer sensitive questions and opinion survey questions in June than in March. GPT-4 performed better at multi-hop questions in June than in March, while GPT-3.5's performance dropped on this task. Both GPT-4 and GPT-3.5 had more formatting mistakes in code generation in June than in March. We provide evidence that GPT-4's ability to follow user instructions has decreased over time, which is one common factor behind the many behavior drifts. Overall, our findings show that the behavior of the 'same' LLM service can change substantially in a relatively short amount of time, highlighting the need for continuous monitoring of LLMs.

show abstract

Section: Related Workmentioning

confidence: 89%

How Is ChatGPT’s Behavior Changing Over Time?

Chen,

Zaharia,

Zou

2024

Harvard Data Science Review

View full text Add to dashboard Cite

show abstract

“…Moreover, an instruction of "approximating the decimal place" was not properly comprehended by ChatGPT during the Japanese-to-English translation. As such, calculation problems are reported as one of the areas where LLMs still exhibit relatively low accuracy [24], indicating that calculation problems may be a relatively unsuitable field for current ChatGPT.…”

Section: Discussionmentioning

confidence: 99%

Performance of Generative Pretrained Transformer on the National Medical Licensing Examination in Japan

Tanaka

Nakata

Aiga

et al. 2023

Preprint

View full text Add to dashboard Cite

The remarkable performance of ChatGPT, launched in November 2022, has significantly impacted the field of natural language processing, inspiring the application of large language models as supportive tools in clinical practice and research worldwide. Although ChatGPT recently scored high on the United States Medical Licensing Examination, its performance on medical licensing examinations of other nations, especially non-English speaking nations, has not been sufficiently evaluated. This study assessed ChatGPT's performance on the National Medical Licensing Examination (NMLE) in Japan and compared it with the actual minimal passing rate for this exam. In particular, the performances of both the GPT-3.5 and GPT-4 models were considered for the comparative analysis. We initially used a model and prompt tuning set of 290 questions without image data from the previous 116th NMLE (held in February 2022) to maximize the performance for delivering correct answers and explanations of the questions. Thereafter, we tested the performance of the best ChatGPT model (GPT-4) with tuned prompts on a dataset of 262 questions without images from the latest 117th NMLE (held in February 2023). The best model with the tuned prompts scored 82.7% for the essential questions and 77.2% for the basic and clinical questions, both of which sufficed the minimum passing rates of 80.0% and 74.6%, respectively. Simultaneously, we identified the three major factors contributing to the generation of the incorrect answers: insufficient medical knowledge, information on Japan-specific medical system and guidelines, and mathematical errors. In conclusion, GPT-4 powered ChatGPT with our optimally tuned prompts achieved a minimum passing rate in the latest 117th NMLE in Japan. Although we express strong concerns regarding the use of the current ChatGPT for medical purposes so far, these artificial intelligence models may soon have the potential to serve as one of the best sidekicks for solving medical and healthcare problems.

show abstract

“…A recent study by Pelton and Pelton (2023) and Shakarian et al (2023) investigated ChatGPT's performance in mathematics and supporting teacher education in mathematics. Their findings suggested that ChatGPT's performance is highly influenced by the requirement to show its work.…”

Section: Role Of Artificial Intelligence In Mathematics Problem-solvingmentioning

confidence: 99%

Pre-service teachers and ChatGPT in multistrategy problem-solving: Implications for mathematics teaching in primary schools

Getenet

2024

INT ELECT J MATH ED

View full text Add to dashboard Cite

This study compared the problem-solving abilities of ChatGPT and 58 pre-service teachers (PSTs) in solving a mathematical word problem using various strategies. PSTs were asked to solve a problem individually. Data was collected from PSTs’ submitted assignments, and their problem-solving strategies were analyzed. ChatGPT was also given the same problem to solve with various prompts, and the correctness of its solutions and problem-solving strategies were assessed alongside those of PSTs. The results indicated that PSTs used diverse strategies and achieved accurate solutions, but not always relevant strategies to children’s level of understanding. ChatGPT employed similar strategies to PSTs but mostly produced incorrect solutions, and its performance needed to be contextualized in the primary school context. The study highlights the potential of ChatGPT in mathematics teaching and informs teacher education programs about the possibility of using it in teaching problem-solving strategies.

show abstract

An Independent Evaluation of ChatGPT on Mathematical Word Problems (MWP)

Cited by 17 publications

References 14 publications

How Is ChatGPT’s Behavior Changing Over Time?

How Is ChatGPT’s Behavior Changing Over Time?

Performance of Generative Pretrained Transformer on the National Medical Licensing Examination in Japan

Pre-service teachers and ChatGPT in multistrategy problem-solving: Implications for mathematics teaching in primary schools

Contact Info

Product

Resources

About