Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2021
DOI: 10.18653/v1/2021.emnlp-main.303
|View full text |Cite
|
Sign up to set email alerts
|

Diagnosing the First-Order Logical Reasoning Ability Through LogicNLI

Abstract: Recently, language models (LMs) have achieved significant performance on many NLU tasks, which has spurred widespread interest for their possible applications in the scientific and social area. However, LMs have faced much criticism of whether they are truly capable of reasoning in NLU. In this work, we propose a diagnostic method for first-order logic (FOL) reasoning with a new proposed benchmark, LogicNLI. LogicNLI is an NLI-style dataset that effectively disentangles the target FOL reasoning from commonsens… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
13
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 12 publications
(13 citation statements)
references
References 35 publications
0
13
0
Order By: Relevance
“…A great proportion of NLP tasks require logical reasoning. Prior work contextualizes the problem of logical reasoning by proposing reasoningdependent datasets and studies solving the tasks with neural models (Johnson et al, 2017;Sinha et al, 2019;Yu et al, 2020;Liu et al, 2020;Tian et al, 2021). However, most studies focus on solving a single task, and the datasets either are designed for a specific domain (Johnson et al, 2017;Sinha et al, 2019), or have confounding factors such as language variance (Yu et al, 2020); they can not be used to strictly or comprehensively study the logical reasoning abilities of models.…”
Section: Related Workmentioning
confidence: 99%
“…A great proportion of NLP tasks require logical reasoning. Prior work contextualizes the problem of logical reasoning by proposing reasoningdependent datasets and studies solving the tasks with neural models (Johnson et al, 2017;Sinha et al, 2019;Yu et al, 2020;Liu et al, 2020;Tian et al, 2021). However, most studies focus on solving a single task, and the datasets either are designed for a specific domain (Johnson et al, 2017;Sinha et al, 2019), or have confounding factors such as language variance (Yu et al, 2020); they can not be used to strictly or comprehensively study the logical reasoning abilities of models.…”
Section: Related Workmentioning
confidence: 99%
“…These studies have found that datasets may contain many spurious artefacts, and the performance of PLMs is enhanced by excessive usage of said artefacts (Habernal et al, 2018;Niven and Kao, 2019;McCoy et al, 2019;Bender and Koller, 2020). Another line of work observed that many PLMs, which showed promising results in GLUE, fall short of expectations for more difficult tasks that require linguistic knowledge (Bhatt et al, 2021) or logical reasoning (Tian et al, 2021). As a result, the importance of well-designed evaluation datasets with higher difficulty-level has been highlighted, and new datasets, such as CHECK-LIST (Ribeiro et al, 2020), and LOGICNLI (Tian et al, 2021), have been proposed.…”
Section: Introductionmentioning
confidence: 99%
“…Another line of work observed that many PLMs, which showed promising results in GLUE, fall short of expectations for more difficult tasks that require linguistic knowledge (Bhatt et al, 2021) or logical reasoning (Tian et al, 2021). As a result, the importance of well-designed evaluation datasets with higher difficulty-level has been highlighted, and new datasets, such as CHECK-LIST (Ribeiro et al, 2020), and LOGICNLI (Tian et al, 2021), have been proposed. Most of them only support specific languages like English, and it requires large efforts to build higher difficultylevel language evaluation suits for other low resource languages, however.…”
Section: Introductionmentioning
confidence: 99%
“…It is also known that neural LMs have difficulty understanding argument order (Kassner et al, 2020), which is arguably a pre-requisite for any logical reasoning. Clark et al (2020) and Tian et al (2021) showed that RoBERTa (Liu et al, 2019), in contrast to BERT, performs well at encoding instructional texts that involve conditionals. Good performance on conditionals in LMs is surprising, since humans typically find reasoning about conditionals challenging due to the fact that it requires accommodating degrees of belief (Politzer, 2007).…”
Section: Introductionmentioning
confidence: 99%
“…Finally, regarding universal quantification, which implicitly involves encoding a hidden conditional statement (e.g. ∀x.P (x) → Q(x)), BERT's performance has been shown to vary substantially (Kim et al, 2019b;Tian et al, 2021).…”
Section: Introductionmentioning
confidence: 99%