2020
DOI: 10.1162/tacl_a_00306
|View full text |Cite
|
Sign up to set email alerts
|

Theoretical Limitations of Self-Attention in Neural Sequence Models

Abstract: Transformers are emerging as the new workhorse of NLP, showing great success across tasks. Unlike LSTMs, transformers process input sequences entirely through self-attention. Previous work has suggested that the computational capabilities of self-attention to process hierarchical structures are limited. In this work, we mathematically investigate the computational power of self-attention to model formal languages. Across both soft and hard attention, we show strong theoretical limitations of the computational … Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

4
105
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 104 publications
(109 citation statements)
references
References 40 publications
4
105
0
Order By: Relevance
“…Recently, it has been shown that Transformers are Turing-complete (Pérez et al, 2019;Bhattamishra et al, 2020) and are universal approximators of sequence-to-sequence functions given arbitrary precision (Yun et al, 2020). Hahn (2020) shows that Transformers cannot recognize languages Parity and Dyck-2. However, these results only apply to very long words, and their applicability to practical-sized inputs is not clear (indeed, we will see different behavior for practicalsized input).…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Recently, it has been shown that Transformers are Turing-complete (Pérez et al, 2019;Bhattamishra et al, 2020) and are universal approximators of sequence-to-sequence functions given arbitrary precision (Yun et al, 2020). Hahn (2020) shows that Transformers cannot recognize languages Parity and Dyck-2. However, these results only apply to very long words, and their applicability to practical-sized inputs is not clear (indeed, we will see different behavior for practicalsized input).…”
Section: Related Workmentioning
confidence: 99%
“…Formal languages are abstract models of the syntax of programming and natural languages; they also relate to cognitive linguistics, e.g., Jäger and Rogers (2012); Hahn (2020) and references therein. Counter Languages.…”
Section: Formal Languagesmentioning
confidence: 99%
“…Theoretical studies on language modeling have mostly targeted simple grammars from the Chomsky hierarchy. In particular, Hahn (2019) proves that Transformer networks suffer limitations in modeling regular periodic languages (such as a n b n ) as well as hierarchical (context-free) structures, unless their depth or self-attention heads increase with the input length. On the other hand, Merrill (2019) proves that LSTM networks can recognize a subset of periodic languages.…”
Section: Related Workmentioning
confidence: 99%
“…However, clinical researchers are more interested in potential limitations that may arise when attention mechanisms are applied, and how they may differ from conventional statistics, than in the details as to how robust and sophisticated attention mechanisms are being developed. A few studies have introduced the potential limitations of attention mechanisms [ 18 , 19 ]. However, these studies have been theoretical, making it difficult for clinical researchers to understand and accept the results.…”
Section: Introductionmentioning
confidence: 99%