Exploring GPT-3 Model's Capability in Passing the Sally-Anne Test A Preliminary Study in Two Languages

Dou, Zenan

doi:10.31219/osf.io/8r3ma

Cited by 4 publications

(4 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The dis-embodied cognition of GPT models could explain failures in recognizing faux pas, but they may also underlie their success on other tests. One example is the false belief test, one of the most widely used tools so far for testing the performance of LLMs on social cognitive tasks 19,[21][22][23]25,51,52 . In this test, participants are presented with a story where a character's belief about the world (the location of the item) differs from the participant's own belief.…”

Section: Discussionmentioning

confidence: 99%

“…The recent rise of large language models (LLMs), such as generative pre-trained transformer (GPT) models, has shown some promise that artificial theory of mind may not be too distant an idea. Generative LLMs exhibit performance that is characteristic of sophisticated decision-making and reasoning abilities 19,20 including solving tasks widely used to test theory of mind in humans [21][22][23][24] . However, the mixed success of these models 23 , along with their vulnerability to small perturbations to the provided prompts, including simple changes in characters' perceptual access 25 , raises concerns about the robustness and interpretability of the observed successes.…”

Section: Performance Across Theory Of Mind Testsmentioning

confidence: 99%

See 1 more Smart Citation

Testing theory of mind in large language models and humans

Strachan,

Albergo,

Borghini

et al. 2024

Nat Hum Behav

View full text Add to dashboard Cite

At the core of what defines us as humans is the concept of theory of mind: the ability to track other people’s mental states. The recent development of large language models (LLMs) such as ChatGPT has led to intense debate about the possibility that these models exhibit behaviour that is indistinguishable from human behaviour in theory of mind tasks. Here we compare human and LLM performance on a comprehensive battery of measurements that aim to measure different theory of mind abilities, from understanding false beliefs to interpreting indirect requests and recognizing irony and faux pas. We tested two families of LLMs (GPT and LLaMA2) repeatedly against these measures and compared their performance with those from a sample of 1,907 human participants. Across the battery of theory of mind tests, we found that GPT-4 models performed at, or even sometimes above, human levels at identifying indirect requests, false beliefs and misdirection, but struggled with detecting faux pas. Faux pas, however, was the only test where LLaMA2 outperformed humans. Follow-up manipulations of the belief likelihood revealed that the superiority of LLaMA2 was illusory, possibly reflecting a bias towards attributing ignorance. By contrast, the poor performance of GPT originated from a hyperconservative approach towards committing to conclusions rather than from a genuine failure of inference. These findings not only demonstrate that LLMs exhibit behaviour that is consistent with the outputs of mentalistic inference in humans but also highlight the importance of systematic testing to ensure a non-superficial comparison between human and artificial intelligences.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Performance Across Theory Of Mind Testsmentioning

confidence: 99%

Testing theory of mind in large language models and humans

Strachan,

Albergo,

Borghini

et al. 2024

Nat Hum Behav

View full text Add to dashboard Cite

show abstract

“…It is used in developmental psychology to measure a person's social cognitive ability to attribute false beliefs to others. See here and [21]. Various versions of the problem can be defined based on whether the boxes are transparent or not.…”

Section: Reasoningmentioning

confidence: 99%

A Categorical Archive of ChatGPT Failures

Borji¹

2023

Preprint

156

View full text Add to dashboard Cite

Large language models have been demonstrated to be valuable in different fields. ChatGPT, developed by OpenAI, has been trained using massive amounts of data and simulates human conversation by comprehending context and generating appropriate responses. It has garnered significant attention due to its ability to effectively answer a broad range of human inquiries, with fluent and comprehensive answers surpassing prior public chatbots in both security and usefulness. However, a comprehensive analysis of ChatGPT’s failures is lacking, which is the focus of this study. Eleven categories of failures, including reasoning, factual errors, math, coding, and bias, are presented and discussed. The risks, limitations, and societal implications of ChatGPT are also highlighted. The goal of this study is to assist researchers and developers in enhancing future language models and chatbots. Please refer to here for the list of questions.

show abstract

“…The recent rise of Large Language Models (LLMs), such as Generative Pre-trained Transformer (GPT) models, has shown some promise that AI Theory of Mind may not be too distant an idea. Generative LLMs exhibit a range of emergent capacities for sophisticated decision-making and reasoning abilities 2,3 including solving tasks widely used to test Theory of Mind in humans [4][5][6] .…”

Section: Introductionmentioning

confidence: 99%

Testing Theory of Mind in GPT Models and Humans

Strachan,

Albergo,

Borghini

et al. 2023

Preprint

View full text Add to dashboard Cite

Interacting with other people involves reasoning about and prediction of others' mental states, or Theory of Mind. This capacity is a distinguishing feature of human cognition but recent advances in Large Language Models (LLMs) such as ChatGPT suggest that they may possess some emergent capacity for human-like Theory of Mind. Such claims merit a systematic approach to explore the limits of GPT models' emergent Theory of Mind capacity and compare it against humans. We show that while GPT models show impressive Theory of Mind-like capacity in controlled tests, there are key deviations from human performance that call into question how human-like this capacity is. Specifically, across a battery of Theory of Mind tests, we found that GPT models performed at human levels when recognising indirect requests, false beliefs, and higher-order mental states like misdirection, but were specifically impaired at recognising faux pas. Follow-up studies revealed that this was due to GPT's conservatism in drawing conclusions that humans took to be self-evident. Our results suggest that while GPT may demonstrate the competence for sophisticated mentalistic inference, its lack of embodiment within an action-oriented environment make this capacity qualitatively different from human cognition.

show abstract

Exploring GPT-3 Model's Capability in Passing the Sally-Anne Test A Preliminary Study in Two Languages

Cited by 4 publications

References 13 publications

Testing theory of mind in large language models and humans

Testing theory of mind in large language models and humans

A Categorical Archive of ChatGPT Failures

Testing Theory of Mind in GPT Models and Humans

Contact Info

Product

Resources

About