On the Ability and Limitations of Transformers to Recognize Formal Languages

Bhattamishra, Satwik; Ahuja, Kabir; Goyal, Navin

doi:10.18653/v1/2020.emnlp-main.576

Cited by 32 publications

(56 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In contrast to us, they studied models trained autoregressively only. Bhattamishra et al (2020) studies how autoregressive Transformer architecture learns a subset of formal languages, including Dyck language and its generalisations. In contrast to our study, they examine Shuffle-Dyck languages, which allows constructions like "([)]" and provide theoretical and experimental evidence that the Transformer is capable of learning such a language.…”

Section: Related Workmentioning

confidence: 99%

Can the Transformer Learn Nested Recursion with Symbol Masking?

Bernardy

Maraev

2021

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

View full text Add to dashboard Cite

We investigate if, given a simple symbol masking strategy, self-attention models are capable of learning nested structures and generalise over their depth. We do so in the simplest setting possible, namely languages consisting of nested parentheses of several kinds. We use encoder-only models, which we train to predict randomly masked symbols, in a BERTlike fashion. We find that the accuracy is well above random baseline, with accuracy consistently above 50% both when increasing nesting depth and distances between training and testing. However, we find that the predictions made correspond to a simple parenthesis counting strategy, rather than a push-down automaton. This suggests that self-attention models are not suitable for tasks which require generalisation to more complex instances of recursive structures than those found in the training set.

show abstract

Section: Related Workmentioning

confidence: 99%

Can the Transformer Learn Nested Recursion with Symbol Masking?

Bernardy

Maraev

2021

Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

View full text Add to dashboard Cite

show abstract

“…Note that, this is a relatively stringent metric as a correct prediction is obtained only when the model's output is correct at every step as opposed to standard classification tasks. Refer to Bhattamishra et al (2020) for a discussion on the choice of character prediction task and its relation with other tasks such as standard classification and language modeling. Details of the dataset and parameters relevant for reproducibility can be found in section C in Appendix.…”

Section: Expressiveness Resultsmentioning

confidence: 99%

On the Practical Ability of Recurrent Neural Networks to Recognize Hierarchical Languages

Bhattamishra

Ahuja

Goyal

2020

Proceedings of the 28th International Conference on Computational Linguistics

Self Cite

View full text Add to dashboard Cite

While recurrent models have been effective in NLP tasks, their performance on context-free languages (CFLs) has been found to be quite weak. Given that CFLs are believed to capture important phenomena such as hierarchical structure in natural languages, this discrepancy in performance calls for an explanation. We study the performance of recurrent models on Dyck-n languages, a particularly important and well-studied class of CFLs. We find that while recurrent models generalize nearly perfectly if the lengths of the training and test strings are from the same range, they perform poorly if the test strings are longer. At the same time, we observe that recurrent models are expressive enough to recognize Dyck words of arbitrary lengths in finite precision if their depths are bounded. Hence, we evaluate our models on samples generated from Dyck languages with bounded depth and find that they are indeed able to generalize to much higher lengths. Since natural language datasets have nested dependencies of bounded depth, this may help explain why they perform well in modeling hierarchical dependencies in natural language data despite prior works indicating poor generalization performance on Dyck languages. We perform probing studies to support our results and provide comparisons with Transformers.

show abstract

“…Hahn (2020) proves that (even with positional encodings) hard-attention Transformers cannot model Dyck k , and soft-attention Transformers with bounded Lipschitz continuity cannot model Dyck k with perfect cross entropy. Bhattamishra et al (2020a) prove a soft-attention network with positional masking (but no positional encodings) can solve Dyck 1 but not Dyck 2 . Despite the expressivity issues theoretically posed by the above work, empirical findings have shown Transformers can learn Dyck k from finite samples and outperform LSTM (Ebrahimi et al, 2020).…”

Section: Related Workmentioning

confidence: 89%

“…In particular, it was recently shown that selfattention networks cannot process various kinds of formal languages (Hahn, 2020;Bhattamishra et al, 2020a), among which particularly notable is Dyck k , the language of well-balanced brackets of k types. By the Chomsky-Schützenberger Theorem (Chomsky and Schützenberger, 1959), any context-free language can be obtained from a Dyck k language through intersections with regular languages and homomorphisms.…”

Section: Introductionmentioning

confidence: 99%

Self-Attention Networks Can Process Bounded Hierarchical Languages

Yao¹,

Peng²,

Papadimitriou³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

Despite their impressive performance in NLP, self-attention networks were recently proved to be limited for processing formal languages with hierarchical structure, such as Dyck k , the language consisting of well-nested parentheses of k types. This suggested that natural language can be approximated well with models that are too weak for formal languages, or that the role of hierarchy and recursion in natural language might be limited. We qualify this implication by proving that self-attention networks can process Dyck k,D , the subset of Dyck k with depth bounded by D, which arguably better captures the bounded hierarchical structure of natural language. Specifically, we construct a hard-attention network with D + 1 layers and O(log k) memory size (per token per layer) that recognizes Dyck k,D , and a soft-attention network with two layers and O(log k) memory size that generates Dyck k,D . Experiments show that self-attention networks trained on Dyck k,D generalize to longer inputs with near-perfect accuracy, and also verify the theoretical memory advantage of self-attention networks over recurrent networks. 1

show abstract

On the Ability and Limitations of Transformers to Recognize Formal Languages

Cited by 32 publications

References 25 publications

Can the Transformer Learn Nested Recursion with Symbol Masking?

Can the Transformer Learn Nested Recursion with Symbol Masking?

On the Practical Ability of Recurrent Neural Networks to Recognize Hierarchical Languages

Self-Attention Networks Can Process Bounded Hierarchical Languages

Contact Info

Product

Resources

About