Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2020
DOI: 10.18653/v1/2020.emnlp-main.576
|View full text |Cite
|
Sign up to set email alerts
|

On the Ability and Limitations of Transformers to Recognize Formal Languages

Abstract: Transformers have supplanted recurrent models in a large number of NLP tasks. However, the differences in their abilities to model different syntactic properties remain largely unknown. Past works suggest that LSTMs generalize very well on regular languages and have close connections with counter languages. In this work, we systematically study the ability of Transformers to model such languages as well as the role of its individual components in doing so. We first provide a construction of Transformers for a … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
24
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
4
1

Relationship

1
8

Authors

Journals

citations
Cited by 32 publications
(56 citation statements)
references
References 25 publications
1
24
0
Order By: Relevance
“…In contrast to us, they studied models trained autoregressively only. Bhattamishra et al (2020) studies how autoregressive Transformer architecture learns a subset of formal languages, including Dyck language and its generalisations. In contrast to our study, they examine Shuffle-Dyck languages, which allows constructions like "([)]" and provide theoretical and experimental evidence that the Transformer is capable of learning such a language.…”
Section: Related Workmentioning
confidence: 99%
“…In contrast to us, they studied models trained autoregressively only. Bhattamishra et al (2020) studies how autoregressive Transformer architecture learns a subset of formal languages, including Dyck language and its generalisations. In contrast to our study, they examine Shuffle-Dyck languages, which allows constructions like "([)]" and provide theoretical and experimental evidence that the Transformer is capable of learning such a language.…”
Section: Related Workmentioning
confidence: 99%
“…Note that, this is a relatively stringent metric as a correct prediction is obtained only when the model's output is correct at every step as opposed to standard classification tasks. Refer to Bhattamishra et al (2020) for a discussion on the choice of character prediction task and its relation with other tasks such as standard classification and language modeling. Details of the dataset and parameters relevant for reproducibility can be found in section C in Appendix.…”
Section: Expressiveness Resultsmentioning
confidence: 99%
“…Hahn (2020) proves that (even with positional encodings) hard-attention Transformers cannot model Dyck k , and soft-attention Transformers with bounded Lipschitz continuity cannot model Dyck k with perfect cross entropy. Bhattamishra et al (2020a) prove a soft-attention network with positional masking (but no positional encodings) can solve Dyck 1 but not Dyck 2 . Despite the expressivity issues theoretically posed by the above work, empirical findings have shown Transformers can learn Dyck k from finite samples and outperform LSTM (Ebrahimi et al, 2020).…”
Section: Related Workmentioning
confidence: 89%
“…In particular, it was recently shown that selfattention networks cannot process various kinds of formal languages (Hahn, 2020;Bhattamishra et al, 2020a), among which particularly notable is Dyck k , the language of well-balanced brackets of k types. By the Chomsky-Schützenberger Theorem (Chomsky and Schützenberger, 1959), any context-free language can be obtained from a Dyck k language through intersections with regular languages and homomorphisms.…”
Section: Introductionmentioning
confidence: 99%