Diagnosing the First-Order Logical Reasoning Ability Through LogicNLI

Tian, Jing; Li, Yitian; Chen, Wenqing; Xiao, Liqiang; He, Hao; Jin, Yaohui

doi:10.18653/v1/2021.emnlp-main.303

Cited by 12 publications

(13 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A great proportion of NLP tasks require logical reasoning. Prior work contextualizes the problem of logical reasoning by proposing reasoningdependent datasets and studies solving the tasks with neural models (Johnson et al, 2017;Sinha et al, 2019;Yu et al, 2020;Liu et al, 2020;Tian et al, 2021). However, most studies focus on solving a single task, and the datasets either are designed for a specific domain (Johnson et al, 2017;Sinha et al, 2019), or have confounding factors such as language variance (Yu et al, 2020); they can not be used to strictly or comprehensively study the logical reasoning abilities of models.…”

Section: Related Workmentioning

confidence: 99%

On the Paradox of Learning to Reason from Data

Zhang¹,

Li²,

Tao³

et al. 2022

Preprint

View full text Add to dashboard Cite

Logical reasoning is needed in a wide range of NLP tasks. Can a BERT model be trained end-to-end to solve logical reasoning problems presented in natural language? We attempt to answer this question in a confined problem space where there exists a set of parameters that perfectly simulates logical reasoning. We make observations that seem to contradict each other: BERT attains near-perfect accuracy on in-distribution test examples while failing to generalize to other data distributions over the exact same problem space. Our study provides an explanation for this paradox: instead of learning to emulate the correct reasoning function, BERT has in fact learned statistical features that inherently exist in logical reasoning problems. We also show that it is infeasible to jointly remove statistical features from data, illustrating the difficulty of learning to reason in general. Our result naturally extends to other neural models and unveils the fundamental difference between learning to reason and learning to achieve high performance on NLP benchmarks using statistical features.

show abstract

Section: Related Workmentioning

confidence: 99%

On the Paradox of Learning to Reason from Data

Zhang¹,

Li²,

Tao³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…These studies have found that datasets may contain many spurious artefacts, and the performance of PLMs is enhanced by excessive usage of said artefacts (Habernal et al, 2018;Niven and Kao, 2019;McCoy et al, 2019;Bender and Koller, 2020). Another line of work observed that many PLMs, which showed promising results in GLUE, fall short of expectations for more difficult tasks that require linguistic knowledge (Bhatt et al, 2021) or logical reasoning (Tian et al, 2021). As a result, the importance of well-designed evaluation datasets with higher difficulty-level has been highlighted, and new datasets, such as CHECK-LIST (Ribeiro et al, 2020), and LOGICNLI (Tian et al, 2021), have been proposed.…”

Section: Introductionmentioning

confidence: 99%

“…Another line of work observed that many PLMs, which showed promising results in GLUE, fall short of expectations for more difficult tasks that require linguistic knowledge (Bhatt et al, 2021) or logical reasoning (Tian et al, 2021). As a result, the importance of well-designed evaluation datasets with higher difficulty-level has been highlighted, and new datasets, such as CHECK-LIST (Ribeiro et al, 2020), and LOGICNLI (Tian et al, 2021), have been proposed. Most of them only support specific languages like English, and it requires large efforts to build higher difficultylevel language evaluation suits for other low resource languages, however.…”

Section: Introductionmentioning

confidence: 99%

KOBEST: Korean Balanced Evaluation of Significant Tasks

Kim¹,

Jang²,

Kwon³

et al. 2022

Preprint

View full text Add to dashboard Cite

A well-formulated benchmark plays a critical role in spurring advancements in the natural language processing (NLP) field, as it allows objective and precise evaluation of diverse models. As modern language models (LMs) have become more elaborate and sophisticated, more difficult benchmarks that require linguistic knowledge and reasoning have been proposed. However, most of these benchmarks only support English, and great effort is necessary to construct benchmarks for other low resource languages. To this end, we propose a new benchmark named Korean balanced evaluation of significant tasks (KoBEST), which consists of five Korean-language downstream tasks. Professional Korean linguists designed the tasks that require advanced Korean linguistic knowledge. Moreover, our data is purely annotated by humans and thoroughly reviewed to guarantee high data quality. We also provide baseline model and human performance results. Our dataset is available on the Huggingface 1 .

show abstract

“…It is also known that neural LMs have difficulty understanding argument order (Kassner et al, 2020), which is arguably a pre-requisite for any logical reasoning. Clark et al (2020) and Tian et al (2021) showed that RoBERTa (Liu et al, 2019), in contrast to BERT, performs well at encoding instructional texts that involve conditionals. Good performance on conditionals in LMs is surprising, since humans typically find reasoning about conditionals challenging due to the fact that it requires accommodating degrees of belief (Politzer, 2007).…”

Section: Introductionmentioning

confidence: 99%

“…Finally, regarding universal quantification, which implicitly involves encoding a hidden conditional statement (e.g. ∀x.P (x) → Q(x)), BERT's performance has been shown to vary substantially (Kim et al, 2019b;Tian et al, 2021).…”

Section: Introductionmentioning

confidence: 99%

AnaLog: Testing Analytical and Deductive Logic Learnability in Language Models

Ryb¹,

Giulianelli²,

Sinclair³

et al. 2022

Proceedings of the 11th Joint Conference on Lexical and Computational Semantics

View full text Add to dashboard Cite

We investigate the extent to which pre-trained language models acquire analytical and deductive logical reasoning capabilities as a side effect of learning word prediction. We present AnaLog, a natural language inference task designed to probe models for these capabilities, controlling for different invalid heuristics the models may adopt instead of learning the desired generalisations. We test four language models on AnaLog, finding that they have all learned, to a different extent, to encode information that is predictive of entailment beyond shallow heuristics such as lexical overlap and grammaticality. We closely analyse the best performing language model and show that while it performs more consistently than other language models across logical connectives and reasoning domains, it still is sensitive to lexical and syntactic variations in the realisation of logical statements. 2 Related Work 2.1 Learning Logic from Text Recent work has explored which aspects of logical reasoning are statistically learnable from text. Examining how well LMs encode the semantics of

show abstract

Diagnosing the First-Order Logical Reasoning Ability Through LogicNLI

Cited by 12 publications

References 35 publications

On the Paradox of Learning to Reason from Data

On the Paradox of Learning to Reason from Data

KOBEST: Korean Balanced Evaluation of Significant Tasks

AnaLog: Testing Analytical and Deductive Logic Learnability in Language Models

Contact Info

Product

Resources

About