Untitled

Chen, Michael; D’Arcy, Mike; Liu, Alisa; Fernandez, Jared; Downey, Doug

doi:10.18653/v1/w19-2008

Cited by 27 publications

(4 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Research progress has traditionally been driven by a cyclical process of resource collection and architectural improvements. Similar to Dynabench, recent work seeks to embrace this phenomenon, addressing many of the previously mentioned issues through an iterative human-and-model-in-the-loop annotation process (Yang et al, 2017;Dinan et al, 2019;Chen et al, 2019;Bartolo et al, 2020;, to find "unknown unknowns" (Attenberg et al, 2015) or in a never-ending or life-long learning setting (Silver et al, 2013;Mitchell et al, 2018). The Adversarial NLI (ANLI) dataset , for example, was collected with an adversarial setting over multiple rounds to yield "a 'moving post' dynamic target for NLU systems, rather than a static benchmark that will eventually saturate".…”

Section: Adversarial Training and Testingmentioning

confidence: 99%

Dynabench: Rethinking Benchmarking in NLP

Kiela¹,

Bartolo²,

Nie³

et al. 2021

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

114

View full text Add to dashboard Cite

We introduce Dynabench, an open-source platform for dynamic dataset creation and model benchmarking. Dynabench runs in a web browser and supports human-and-model-inthe-loop dataset creation: annotators seek to create examples that a target model will misclassify, but that another person will not. In this paper, we argue that Dynabench addresses a critical need in our community: contemporary models quickly achieve outstanding performance on benchmark tasks but nonetheless fail on simple challenge examples and falter in real-world scenarios. With Dynabench, dataset creation, model development, and model assessment can directly inform each other, leading to more robust and informative benchmarks. We report on four initial NLP tasks, illustrating these concepts and highlighting the promise of the platform, and address potential objections to dynamic benchmarking as a new standard for the field.

show abstract

Section: Adversarial Training and Testingmentioning

confidence: 99%

Dynabench: Rethinking Benchmarking in NLP

Kiela¹,

Bartolo²,

Nie³

et al. 2021

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

114

View full text Add to dashboard Cite

show abstract

“…With the tremendous success and growing societal impact of DNNs, understanding and interpreting the behavior of DNNs has become an urgent necessity. In terms of NLP, While DNNs are reported as having achieved human-level performance in many tasks, including QA (Chen et al, 2019), sentence-level RE , and NLI (Devlin et al, 2018), their decision rules found by feature attribution (FA) methods are different from that of humans in many cases. For example, in argument detection, the widely adopted language model BERT succeeds in finding the most correct arguments only by detecting the presence of "not" (Niven and Kao, 2019).…”

Section: Related Workmentioning

confidence: 99%

Did the Models Understand Documents? Benchmarking Models for Language Understanding in Document-Level Relation Extraction

Chen,

Zhou

2023

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

View full text Add to dashboard Cite

Document-level relation extraction (DocRE) attracts more research interest recently. While models achieve consistent performance gains in DocRE, their underlying decision rules are still understudied: Do they make the right predictions according to rationales? In this paper, we take the first step toward answering this question and then introduce a new perspective on comprehensively evaluating a model. Specifically, we first conduct annotations to provide the rationales considered by humans in DocRE. Then, we conduct investigations and reveal the fact that: In contrast to humans, the representative state-of-the-art (SOTA) models in DocRE exhibit different decision rules. Through our proposed RE-specific attacks, we next demonstrate that the significant discrepancy in decision rules between models and humans severely damages the robustness of models and renders them inapplicable to real-world RE scenarios. After that, we introduce mean average precision (MAP) to evaluate the understanding and reasoning capabilities of models. According to the extensive experimental results, we finally appeal to future work to consider evaluating both performance and the understanding ability of models for the development of their applications. We make our annotations and code publicly available 1 .

show abstract

“…Commonsense Reasoning Benchmarks Many benchmarks measuring the commonsense reasoning abilities of state-of-the-art models have been released in recent years. Starting with the well-known Winograd Schema Challenge (WSC; Levesque et al, 2011), these benchmarks have attempted to test the commonsense reasoning ability of models using different task formats, such as pronoun resolution (Levesque et al, 2011;Rudinger et al, 2018;Eisenschlos et al, 2023), questionanswering (Talmor et al, 2019;Zellers et al, 2018;Chen et al, 2019;Reddy et al, 2019;, plausible inference (Roemmele et al, 2011;Wang et al, 2019b;Singh et al, 2021;Gao et al, 2022) and natural language generation (Lin et al, 2020b). Benchmarks have also been created to evaluate commonsense reasoning across different dimensions of commonsense knowledge, including social (Rashkin et al, 2018a,b;Sap et al, 2019b), physical Dalvi et al, 2018;Storks et al, 2021), temporal (Qin et al, 2021;Zhou et al, 2019) and numerical reasoning (Lin et al, 2020a).…”

Section: Related Workmentioning

confidence: 99%

CRoW: Benchmarking Commonsense Reasoning in Real-World Tasks

Ismayilzada,

Paul,

Montariol

et al. 2023

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

Recent efforts in natural language processing (NLP) commonsense reasoning research have yielded a considerable number of new datasets and benchmarks. However, most of these datasets formulate commonsense reasoning challenges in artificial scenarios that are not reflective of the tasks which real-world NLP systems are designed to solve. In this work, we present CROW, a manually-curated, multitask benchmark that evaluates the ability of models to apply commonsense reasoning in the context of six real-world NLP tasks. CROW is constructed using a multi-stage data collection pipeline that rewrites examples from existing datasets using commonsense-violating perturbations. We use CROWto study how NLP systems perform across different dimensions of commonsense knowledge, such as physical, temporal, and social reasoning. We find a significant performance gap when NLP systems are evaluated on CROWcompared to humans, showcasing that commonsense reasoning is far from being solved in real-world task settings. We make our dataset and leaderboard available to the research community. 1 * Equal contribution 1 https://github.com/mismayil/crow Dialogue Agent: Hi, would you like some free candies? Human: Sure. What are you handing these out for? Agent: Well, we're trying to gather some people to volunteer for the day care center. Human: Uh… Agent: It's OK. You don't have to volunteer if you eat the candies. Agent: It's OK. You don't have to eat the candies if you volunteer.

show abstract

Untitled

Cited by 27 publications

References 10 publications

Dynabench: Rethinking Benchmarking in NLP

Dynabench: Rethinking Benchmarking in NLP

Did the Models Understand Documents? Benchmarking Models for Language Understanding in Document-Level Relation Extraction

CRoW: Benchmarking Commonsense Reasoning in Real-World Tasks

Contact Info

Product

Resources

About