Benchmarks for Automated Commonsense Reasoning: A Survey

Davis, Ernest

doi:10.48550/arxiv.2302.04752

Cited by 5 publications

(6 citation statements)

References 124 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Benchmark datasets are regularly used to both train and test machine commonsense reasoning systems, a good overview of which is provided by Davis 39 . He describes different methods that can be used to develop benchmark datasets and highlights some obstacles, including issues with data quality (incorrect or inconsistent values) and the reliance on 'single' gold standards (derived from humans who can be very inconsistent with each other) as a sole basis for performance measurement.…”

Section: Discussionmentioning

confidence: 99%

A noise audit of human-labeled benchmarks for machine commonsense reasoning

Kejriwal,

Santos,

Shen

et al. 2024

Sci Rep

View full text Add to dashboard Cite

With the advent of large language models, evaluating and benchmarking these systems on important AI problems has taken on newfound importance. Such benchmarking typically involves comparing the predictions of a system against human labels (or a single ‘ground-truth’). However, much recent work in psychology has suggested that most tasks involving significant human judgment can have non-trivial degrees of noise. In his book, Kahneman suggests that noise may be a much more significant component of inaccuracy compared to bias, which has been studied more extensively in the AI community. This article proposes a detailed noise audit of human-labeled benchmarks in machine commonsense reasoning, an important current area of AI research. We conduct noise audits under two important experimental conditions: one in a smaller-scale but higher-quality labeling setting, and another in a larger-scale, more realistic online crowdsourced setting. Using Kahneman’s framework of noise, our results consistently show non-trivial amounts of level, pattern, and system noise, even in the higher-quality setting, with comparable results in the crowdsourced setting. We find that noise can significantly influence the performance estimates that we obtain of commonsense reasoning systems, even if the ‘system’ is a human; in some cases, by almost 10 percent. Labeling noise also affects performance estimates of systems like ChatGPT by more than 4 percent. Our results suggest that the default practice in the AI community of assuming and using a ‘single’ ground-truth, even on problems requiring seemingly straightforward human judgment, may warrant empirical and methodological re-visiting.

show abstract

Section: Discussionmentioning

confidence: 99%

A noise audit of human-labeled benchmarks for machine commonsense reasoning

Kejriwal,

Santos,

Shen

et al. 2024

Sci Rep

View full text Add to dashboard Cite

show abstract

“…Benchmark datasets are regularly used to both train and test machine commonsense reasoning systems, a good overview of which is provided by Davis 30 . He describes different methods that can be used to develop benchmark datasets and highlights some obstacles, including issues with data quality (incorrect or inconsistent values) and the reliance on 'single' gold standards (derived from humans who can be very inconsistent with each other) as a sole basis for performance measurement.…”

Section: Discussionmentioning

confidence: 99%

A noise audit of human-labeled benchmarks for machine commonsense reasoning

Kejriwal,

Santos,

Shen

et al. 2023

Preprint

View full text Add to dashboard Cite

With the advent of large language models, evaluating and benchmarking these systems on important AI problems has taken on newfound importance. Such benchmarking typically involves comparing the predictions of a system against human labels (or a single 'ground-truth'). However, much recent work in psychology has suggested that most tasks involving significant human judgment can have non-trivial degrees of noise. In his book, Kahneman suggests that noise may be a much more significant component of inaccuracy compared to bias, which has been studied more extensively in the AI community. This article proposes a detailed noise audit of human-labeled benchmarks in machine commonsense reasoning, an important current area of AI research. We conduct noise audits under two important experimental conditions: one in a smaller-scale but higher-quality labeling setting, and another in a larger-scale, more realistic online crowdsourced setting. Using Kahneman's framework of noise, our results consistently show non-trivial amounts of level (2.42 to 6.46 percent), pattern (22.22 to 23.40 percent) and system noise (22.32 to 24.23 percent) even in the higher-quality setting, with comparable results in the crowdsourced setting. We find that noise can significantly influence the performance estimates that we obtain of commonsense reasoning systems, even if the 'system' is a human; in some cases, by almost 10 percent. Labeling noise also affects performance estimates of systems like ChatGPT by more than 4 percent. Our results suggest that the default practice in the AI community of assuming and using a 'single' ground-truth, even on problems requiring seemingly straightforward human judgment, may warrant empirical and methodological re-visiting.

show abstract

“…Storks et al [2019] apresentam uma classificação para os benchmarks que realizam a avaliação de raciocínio de senso comum para compreensão de linguagem natural, de acordo com o tipo de atividade testada. Um levantamento mais recente, realizado por Davis [2023] lista 139 benchmarks, sendo 102 de texto, 18 para imagens, 12 de vídeos e 7 de ambientes físicos.…”

Section: Winograd E a Evolução Dos Benchmarksunclassified

Avaliação do senso comum em modelos de linguagem através de benchmarks: Desafio de Winograd aplicado ao ChatGPT em português brasileiro

do Nascimento,

Cortiz

2023

Anais Do XIV Simpósio Brasileiro De Tecnologia Da Informação E Da Linguagem Humana (STIL 2023)

View full text Add to dashboard Cite

O desempenho em benchmarks é apresentado como uma forma de avaliação efetiva dos limites de compreensão dos modelos de linguagem. Neste sentido, o desafio de esquemas de Winograd, que se propõe a avaliar o senso comum por meio de tarefas de desambiguação de pronomes, deu origem a diferentes métricas e datasets. Ao aplicar a tradução do desafio de Winograd ao ChatGPT em português brasileiro, identificamos resultados equiparáveis aos obtidos em inglês. Contudo, é preciso ter cautela ao interpretar estes dados, visto que existem vieses associados ao treinamento dos modelos e lacunas quanto às dimensões de raciocínio contempladas pelos métodos de avaliação disponíveis.

show abstract

Benchmarks for Automated Commonsense Reasoning: A Survey

Cited by 5 publications

References 124 publications

A noise audit of human-labeled benchmarks for machine commonsense reasoning

A noise audit of human-labeled benchmarks for machine commonsense reasoning

A noise audit of human-labeled benchmarks for machine commonsense reasoning

Avaliação do senso comum em modelos de linguagem através de benchmarks: Desafio de Winograd aplicado ao ChatGPT em português brasileiro

Contact Info

Product

Resources

About