2024
DOI: 10.1109/tai.2023.3332837
|View full text |Cite
|
Sign up to set email alerts
|

A Culturally Sensitive Test to Evaluate Nuanced GPT Hallucination

Timothy R. McIntosh,
Tong Liu,
Teo Susnjak
et al.
Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
35
0

Year Published

2024
2024
2024
2024

Publication Types

Select...
5
3
2

Relationship

0
10

Authors

Journals

citations
Cited by 48 publications
(35 citation statements)
references
References 41 publications
0
35
0
Order By: Relevance
“…The evaluation of large language models has been rigorously conducted using a variety of benchmark datasets designed to test their capabilities across multiple dimensions [26,27]. Benchmarks like GLUE and SuperGLUE have set the standard for evaluating model performance on tasks such as text classification, sentiment analysis, and natural language inference, and they provide a comprehensive suite of tasks that collectively measure a model's ability to understand and generate human language accurately [26].…”
Section: Related Workmentioning
confidence: 99%
“…The evaluation of large language models has been rigorously conducted using a variety of benchmark datasets designed to test their capabilities across multiple dimensions [26,27]. Benchmarks like GLUE and SuperGLUE have set the standard for evaluating model performance on tasks such as text classification, sentiment analysis, and natural language inference, and they provide a comprehensive suite of tasks that collectively measure a model's ability to understand and generate human language accurately [26].…”
Section: Related Workmentioning
confidence: 99%
“…Studies on the development of automated and scalable evaluation metrics for visual tasks have laid the groundwork for more reliable and comprehensive assessments of AI model performance in interpreting visual data. The creation of benchmark datasets and standardized testing environments aimed to mirror a range of real-world scenarios, thus providing a robust framework for measuring the accuracy and reliability of visual interpretations by AI systems [23], [24]. Innovative metrics designed to evaluate the qualitative aspects of visual outputs, such as the relevance and fidelity of generated images or the preciseness of image captions, marked a pivotal shift toward nuanced and context-aware evaluations [25], [26].…”
Section: Automated and Scalable Evaluation Metrics For Visual Tasksmentioning
confidence: 99%
“…The challenges in maintaining context and providing accurate references in responses were highlighted, revealing a gap in performance on tasks requiring deep knowledge or technical expertise [8], [10]. Another research article demonstrated the tendency of LLMs to generate plausible but factually incorrect information, underscoring the need for enhanced reasoning capabilities [11], [12]. The issue of data recency and the models' ability to incorporate the latest information was also discussed, emphasizing the static nature of LLM training data [13]- [15].…”
Section: A Limitations Of Llms In Information Retrieval and Reasoningmentioning
confidence: 99%