Proceedings of the 3rd Workshop on Evaluating Vector Space Representations For 2019
DOI: 10.18653/v1/w19-2008
|View full text |Cite
|
Sign up to set email alerts
|

Untitled

Abstract: Commonsense reasoning is a critical AI capability, but it is difficult to construct challenging datasets that test common sense. Recent neural question answering systems, based on large pre-trained models of language, have already achieved near-human-level performance on commonsense knowledge benchmarks. These systems do not possess human-level common sense, but are able to exploit limitations of the datasets to achieve human-level scores. We introduce the CODAH dataset, an adversarially-constructed evaluation… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2

Citation Types

0
4
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
6
3

Relationship

0
9

Authors

Journals

citations
Cited by 27 publications
(4 citation statements)
references
References 10 publications
0
4
0
Order By: Relevance
“…Research progress has traditionally been driven by a cyclical process of resource collection and architectural improvements. Similar to Dynabench, recent work seeks to embrace this phenomenon, addressing many of the previously mentioned issues through an iterative human-and-model-in-the-loop annotation process (Yang et al, 2017;Dinan et al, 2019;Chen et al, 2019;Bartolo et al, 2020;, to find "unknown unknowns" (Attenberg et al, 2015) or in a never-ending or life-long learning setting (Silver et al, 2013;Mitchell et al, 2018). The Adversarial NLI (ANLI) dataset , for example, was collected with an adversarial setting over multiple rounds to yield "a 'moving post' dynamic target for NLU systems, rather than a static benchmark that will eventually saturate".…”
Section: Adversarial Training and Testingmentioning
confidence: 99%
“…Research progress has traditionally been driven by a cyclical process of resource collection and architectural improvements. Similar to Dynabench, recent work seeks to embrace this phenomenon, addressing many of the previously mentioned issues through an iterative human-and-model-in-the-loop annotation process (Yang et al, 2017;Dinan et al, 2019;Chen et al, 2019;Bartolo et al, 2020;, to find "unknown unknowns" (Attenberg et al, 2015) or in a never-ending or life-long learning setting (Silver et al, 2013;Mitchell et al, 2018). The Adversarial NLI (ANLI) dataset , for example, was collected with an adversarial setting over multiple rounds to yield "a 'moving post' dynamic target for NLU systems, rather than a static benchmark that will eventually saturate".…”
Section: Adversarial Training and Testingmentioning
confidence: 99%
“…With the tremendous success and growing societal impact of DNNs, understanding and interpreting the behavior of DNNs has become an urgent necessity. In terms of NLP, While DNNs are reported as having achieved human-level performance in many tasks, including QA (Chen et al, 2019), sentence-level RE , and NLI (Devlin et al, 2018), their decision rules found by feature attribution (FA) methods are different from that of humans in many cases. For example, in argument detection, the widely adopted language model BERT succeeds in finding the most correct arguments only by detecting the presence of "not" (Niven and Kao, 2019).…”
Section: Related Workmentioning
confidence: 99%
“…Commonsense Reasoning Benchmarks Many benchmarks measuring the commonsense reasoning abilities of state-of-the-art models have been released in recent years. Starting with the well-known Winograd Schema Challenge (WSC; Levesque et al, 2011), these benchmarks have attempted to test the commonsense reasoning ability of models using different task formats, such as pronoun resolution (Levesque et al, 2011;Rudinger et al, 2018;Eisenschlos et al, 2023), questionanswering (Talmor et al, 2019;Zellers et al, 2018;Chen et al, 2019;Reddy et al, 2019;, plausible inference (Roemmele et al, 2011;Wang et al, 2019b;Singh et al, 2021;Gao et al, 2022) and natural language generation (Lin et al, 2020b). Benchmarks have also been created to evaluate commonsense reasoning across different dimensions of commonsense knowledge, including social (Rashkin et al, 2018a,b;Sap et al, 2019b), physical Dalvi et al, 2018;Storks et al, 2021), temporal (Qin et al, 2021;Zhou et al, 2019) and numerical reasoning (Lin et al, 2020a).…”
Section: Related Workmentioning
confidence: 99%