Processing of nested and cross-serial dependencies: an automaton perspective on SRN behaviour

Kirov, Christo; Frank, Robert

doi:10.1080/09540091.2011.641939

Cited by 13 publications

(11 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…bitstrings can be predicted with perfect accuracy and cross-entropy, independent of the input length. Furthermore, infinite-precision RNNs and LSTMs can model stacks (Tabor, 2000;Grüning, 2006;Kirov and Frank, 2012) and thus are theoretically capable of modeling 2DYCK and other deterministic context-free languages perfectly. The results presented here thus theoretically confirm the intuition that models entirely built on self-attention may have restricted expressivity when compared to recurrent architectures (Tran et al, 2018;Dehghani et al, 2019;Shen et al, 2018a;Chen et al, 2018;Hao et al, 2019).…”

Section: Discussionmentioning

confidence: 99%

Theoretical Limitations of Self-Attention in Neural Sequence Models

Hahn

2020

Transactions of the Association for Computational Linguistics

104

View full text Add to dashboard Cite

Transformers are emerging as the new workhorse of NLP, showing great success across tasks. Unlike LSTMs, transformers process input sequences entirely through self-attention. Previous work has suggested that the computational capabilities of self-attention to process hierarchical structures are limited. In this work, we mathematically investigate the computational power of self-attention to model formal languages. Across both soft and hard attention, we show strong theoretical limitations of the computational abilities of self-attention, finding that it cannot model periodic finite-state languages, nor hierarchical structure, unless the number of layers or heads increases with input length. These limitations seem surprising given the practical success of self-attention and the prominent role assigned to hierarchical structure in linguistics, suggesting that natural language can be approximated well with models that are too weak for the formal languages typically assumed in theoretical linguistics.

show abstract

Section: Discussionmentioning

confidence: 99%

Theoretical Limitations of Self-Attention in Neural Sequence Models

Hahn

2020

Transactions of the Association for Computational Linguistics

104

View full text Add to dashboard Cite

show abstract

“…Recent NLP work has also found that neural networks do not readily transfer knowledge across tasks; e.g., pretrained models often perform worse than non-pretrained models (Wang et al, 2019). This lack of generalization across tasks might be due to the tendency of multi-task neural networks to create largely independent representations for different tasks even when a shared representation could be used (Kirov and Frank, 2012). Therefore, to make cross-phenomenon generalizations, neural networks may need to be given an explicit bias for sharing processing across phenomena.…”

Section: Will Models Generalize Acrossmentioning

confidence: 99%

Does Syntax Need to Grow on Trees? Sources of Hierarchical Inductive Bias in Sequence-to-Sequence Networks

McCoy

Frank

Linzen

2020

Transactions of the Association for Computational Linguistics

Self Cite

View full text Add to dashboard Cite

Learners that are exposed to the same training data might generalize differently due to differing inductive biases. In neural network models, inductive biases could in theory arise from any aspect of the model architecture. We investigate which architectural factors affect the generalization behavior of neural sequence-to-sequence models trained on two syntactic tasks, English question formation and English tense reinflection. For both tasks, the training set is consistent with a generalization based on hierarchical structure and a generalization based on linear order. All architectural factors that we investigated qualitatively affected how models generalized, including factors with no clear connection to hierarchical structure. For example, LSTMs and GRUs displayed qualitatively different inductive biases. However, the only factor that consistently contributed a hierarchical bias across tasks was the use of a tree-structured model rather than a model with sequential recurrence, suggesting that human-like syntactic generalization requires architectural syntactic structure.

show abstract

“…This property amounts to a short‐shrifting of the encoding resources used for more deeply embedded causal states relative to less deeply embedded causal states: If there is noise in the encodings, the noise distorts deeper embeddings more than shallow ones. This short‐shrifting is plausibly related to the well‐known limited ability of humans to process deep center‐embeddings—see (Christiansen & Chater, ; Kirov & Frank, ). We would like, therefore, to objectively determine whether the system is exhibiting contraction for pushes and expansion for pops.…”

Section: Fractal Learning Neural Networkmentioning

confidence: 97%

Fractal Analysis Illuminates the Form of Connectionist Structural Gradualness

Tabor

Cho

Szkudlarek

2013

Topics in Cognitive Science

View full text Add to dashboard Cite

We examine two connectionist networks-a fractal learning neural network (FLNN) and a Simple Recurrent Network (SRN)-that are trained to process center-embedded symbol sequences. Previous work provides evidence that connectionist networks trained on infinite-state languages tend to form fractal encodings. Most such work focuses on simple counting recursion cases (e.g., a n b n ), which are not comparable to the complex recursive patterns seen in natural language syntax. Here, we consider exponential state growth cases (including mirror recursion), describe a new training scheme that seems to facilitate learning, and note that the connectionist learning of these cases has a continuous metamorphosis property that looks very different from what is achievable with symbolic encodings. We identify a property-ragged progressive generalization-which helps make this difference clearer. We suggest two conclusions. First, the fractal analysis of these more complex learning cases reveals the possibility of comparing connectionist networks and symbolic models of grammatical structure in a principled way-this helps remove the black box character of connectionist networks and indicates how the theory they support is different from symbolic approaches. Second, the findings indicate the value of future, linked mathematical and empirical work on these models-something that is more possible now than it was 10 years ago.

show abstract

Processing of nested and cross-serial dependencies: an automaton perspective on SRN behaviour

Cited by 13 publications

References 24 publications

Theoretical Limitations of Self-Attention in Neural Sequence Models

Theoretical Limitations of Self-Attention in Neural Sequence Models

Does Syntax Need to Grow on Trees? Sources of Hierarchical Inductive Bias in Sequence-to-Sequence Networks

Fractal Analysis Illuminates the Form of Connectionist Structural Gradualness

Contact Info

Product

Resources

About