Theoretical Limitations of Self-Attention in Neural Sequence Models

Hahn, Michael

doi:10.1162/tacl_a_00306

Cited by 104 publications

(109 citation statements)

References 40 publications

Supporting

Mentioning

105

Contrasting

Order By: Relevance

“…Recently, it has been shown that Transformers are Turing-complete (Pérez et al, 2019;Bhattamishra et al, 2020) and are universal approximators of sequence-to-sequence functions given arbitrary precision (Yun et al, 2020). Hahn (2020) shows that Transformers cannot recognize languages Parity and Dyck-2. However, these results only apply to very long words, and their applicability to practical-sized inputs is not clear (indeed, we will see different behavior for practicalsized input).…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

On the Ability and Limitations of Transformers to Recognize Formal Languages

Bhattamishra¹,

Ahuja²,

Goyal³

2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

Transformers have supplanted recurrent models in a large number of NLP tasks. However, the differences in their abilities to model different syntactic properties remain largely unknown. Past works suggest that LSTMs generalize very well on regular languages and have close connections with counter languages. In this work, we systematically study the ability of Transformers to model such languages as well as the role of its individual components in doing so. We first provide a construction of Transformers for a subclass of counter languages, including well-studied languages such as n-ary Boolean Expressions, Dyck-1, and its generalizations. In experiments, we find that Transformers do well on this subclass, and their learned mechanism strongly correlates with our construction. Perhaps surprisingly, in contrast to LSTMs, Transformers do well only on a subset of regular languages with degrading performance as we make languages more complex according to a well-known measure of complexity. Our analysis also provides insights on the role of self-attention mechanism in modeling certain behaviors and the influence of positional encoding schemes on the learning and generalization abilities of the model.

show abstract

Section: Related Workmentioning

confidence: 99%

“…Formal languages are abstract models of the syntax of programming and natural languages; they also relate to cognitive linguistics, e.g., Jäger and Rogers (2012); Hahn (2020) and references therein. Counter Languages.…”

Section: Formal Languagesmentioning

confidence: 99%

On the Ability and Limitations of Transformers to Recognize Formal Languages

Bhattamishra¹,

Ahuja²,

Goyal³

2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

show abstract

“…Theoretical studies on language modeling have mostly targeted simple grammars from the Chomsky hierarchy. In particular, Hahn (2019) proves that Transformer networks suffer limitations in modeling regular periodic languages (such as a n b n ) as well as hierarchical (context-free) structures, unless their depth or self-attention heads increase with the input length. On the other hand, Merrill (2019) proves that LSTM networks can recognize a subset of periodic languages.…”

Section: Related Workmentioning

confidence: 99%

Joint Translation and Unit Conversion for End-to-end Localization

Dinu¹,

Mathur²,

Federico³

et al. 2020

Proceedings of the 17th International Conference on Spoken Language Translation

View full text Add to dashboard Cite

A variety of natural language tasks require processing of textual data which contains a mix of natural language and formal languages such as mathematical expressions. In this paper, we take unit conversions as an example and propose a data augmentation technique which lead to models learning both translation and conversion tasks as well as how to adequately switch between them for end-to-end localization.

show abstract

“…However, clinical researchers are more interested in potential limitations that may arise when attention mechanisms are applied, and how they may differ from conventional statistics, than in the details as to how robust and sophisticated attention mechanisms are being developed. A few studies have introduced the potential limitations of attention mechanisms [ 18 , 19 ]. However, these studies have been theoretical, making it difficult for clinical researchers to understand and accept the results.…”

Section: Introductionmentioning

confidence: 99%

Limitations of Deep Learning Attention Mechanisms in Clinical Research: Empirical Case Study Based on the Korean Diabetic Disease Setting

Kim¹,

Lee²,

Hwang³

et al. 2020

J Med Internet Res

View full text Add to dashboard Cite

Background Despite excellent prediction performance, noninterpretability has undermined the value of applying deep-learning algorithms in clinical practice. To overcome this limitation, attention mechanism has been introduced to clinical research as an explanatory modeling method. However, potential limitations of using this attractive method have not been clarified to clinical researchers. Furthermore, there has been a lack of introductory information explaining attention mechanisms to clinical researchers. Objective The aim of this study was to introduce the basic concepts and design approaches of attention mechanisms. In addition, we aimed to empirically assess the potential limitations of current attention mechanisms in terms of prediction and interpretability performance. Methods First, the basic concepts and several key considerations regarding attention mechanisms were identified. Second, four approaches to attention mechanisms were suggested according to a two-dimensional framework based on the degrees of freedom and uncertainty awareness. Third, the prediction performance, probability reliability, concentration of variable importance, consistency of attention results, and generalizability of attention results to conventional statistics were assessed in the diabetic classification modeling setting. Fourth, the potential limitations of attention mechanisms were considered. Results Prediction performance was very high for all models. Probability reliability was high in models with uncertainty awareness. Variable importance was concentrated in several variables when uncertainty awareness was not considered. The consistency of attention results was high when uncertainty awareness was considered. The generalizability of attention results to conventional statistics was poor regardless of the modeling approach. Conclusions The attention mechanism is an attractive technique with potential to be very promising in the future. However, it may not yet be desirable to rely on this method to assess variable importance in clinical settings. Therefore, along with theoretical studies enhancing attention mechanisms, more empirical studies investigating potential limitations should be encouraged.

show abstract

Theoretical Limitations of Self-Attention in Neural Sequence Models

Cited by 104 publications

References 40 publications

On the Ability and Limitations of Transformers to Recognize Formal Languages

On the Ability and Limitations of Transformers to Recognize Formal Languages

Joint Translation and Unit Conversion for End-to-end Localization

Limitations of Deep Learning Attention Mechanisms in Clinical Research: Empirical Case Study Based on the Korean Diabetic Disease Setting

Contact Info

Product

Resources

About