2023
DOI: 10.48550/arxiv.2302.04752
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Benchmarks for Automated Commonsense Reasoning: A Survey

Abstract: More than one hundred benchmarks have been developed to test the commonsense knowledge and commonsense reasoning abilities of artificial intelligence (AI) systems. However, these benchmarks are often flawed and many aspects of common sense remain untested. Consequently, we do not currently have any reliable way of measuring to what extent existing AI systems have achieved these abilities.This paper surveys the development and uses of AI commonsense benchmarks. We discuss the nature of common sense; the role of… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
4
0
2

Year Published

2023
2023
2024
2024

Publication Types

Select...
2
2
1

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(6 citation statements)
references
References 124 publications
0
4
0
2
Order By: Relevance
“…Benchmark datasets are regularly used to both train and test machine commonsense reasoning systems, a good overview of which is provided by Davis 39 . He describes different methods that can be used to develop benchmark datasets and highlights some obstacles, including issues with data quality (incorrect or inconsistent values) and the reliance on 'single' gold standards (derived from humans who can be very inconsistent with each other) as a sole basis for performance measurement.…”
Section: Discussionmentioning
confidence: 99%
“…Benchmark datasets are regularly used to both train and test machine commonsense reasoning systems, a good overview of which is provided by Davis 39 . He describes different methods that can be used to develop benchmark datasets and highlights some obstacles, including issues with data quality (incorrect or inconsistent values) and the reliance on 'single' gold standards (derived from humans who can be very inconsistent with each other) as a sole basis for performance measurement.…”
Section: Discussionmentioning
confidence: 99%
“…Benchmark datasets are regularly used to both train and test machine commonsense reasoning systems, a good overview of which is provided by Davis 30 . He describes different methods that can be used to develop benchmark datasets and highlights some obstacles, including issues with data quality (incorrect or inconsistent values) and the reliance on 'single' gold standards (derived from humans who can be very inconsistent with each other) as a sole basis for performance measurement.…”
Section: Discussionmentioning
confidence: 99%
“…Storks et al [2019] apresentam uma classificação para os benchmarks que realizam a avaliação de raciocínio de senso comum para compreensão de linguagem natural, de acordo com o tipo de atividade testada. Um levantamento mais recente, realizado por Davis [2023] lista 139 benchmarks, sendo 102 de texto, 18 para imagens, 12 de vídeos e 7 de ambientes físicos.…”
Section: Winograd E a Evolução Dos Benchmarksunclassified