Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer 2021
DOI: 10.18653/v1/2021.acl-long.379
|View full text |Cite
|
Sign up to set email alerts
|

R2D2: Recursive Transformer based on Differentiable Tree for Interpretable Hierarchical Language Modeling

Abstract: Human language understanding operates at multiple levels of granularity (e.g., words, phrases, and sentences) with increasing levels of abstraction that can be hierarchically combined. However, existing deep models with stacked layers do not explicitly model any sort of hierarchical process. This paper proposes a recursive Transformer model based on differentiable CKY style binary trees to emulate the composition process. We extend the bidirectional language model pre-training objective to this architecture, a… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
13
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 11 publications
(13 citation statements)
references
References 24 publications
0
13
0
Order By: Relevance
“…Such remarks can also be made about the many probabilistic or non-probabilistic bottom-up, topdown, or left corner parsing algorithms which have been studied over the years as models of sentence processing (Earley, 1970;Rosenkrantz and Lewis, 1970;Marcus, 1978;Abney and Johnson, 1991;Berwick and Weinberg, 1982;Roark, 2001;Nivre, 2008;Stabler, 2013;Graf et al, 2017). Likewise for transformer or RNNbased parsing models (e.g., Costa, 2003;Jin and Schuler, 2020;Yang and Deng, 2020;Hu et al, 2021Hu et al, , 2022 or causal language models (Hochreiter and Schmidhuber, 1997;Radford et al, 2018Radford et al, , 2019Dai et al, 2019;Brown et al, 2020). The amount of work required by these algorithms to integrate or predict the next word scales in quantities such as the size of the vocabulary and the length of the input, but never directly as a function of the probability of the next word.…”
Section: Algorithms That Do Not Scale In Surprisalmentioning
confidence: 99%
“…Such remarks can also be made about the many probabilistic or non-probabilistic bottom-up, topdown, or left corner parsing algorithms which have been studied over the years as models of sentence processing (Earley, 1970;Rosenkrantz and Lewis, 1970;Marcus, 1978;Abney and Johnson, 1991;Berwick and Weinberg, 1982;Roark, 2001;Nivre, 2008;Stabler, 2013;Graf et al, 2017). Likewise for transformer or RNNbased parsing models (e.g., Costa, 2003;Jin and Schuler, 2020;Yang and Deng, 2020;Hu et al, 2021Hu et al, , 2022 or causal language models (Hochreiter and Schmidhuber, 1997;Radford et al, 2018Radford et al, , 2019Dai et al, 2019;Brown et al, 2020). The amount of work required by these algorithms to integrate or predict the next word scales in quantities such as the size of the vocabulary and the length of the input, but never directly as a function of the probability of the next word.…”
Section: Algorithms That Do Not Scale In Surprisalmentioning
confidence: 99%
“…Thirdly, it has a pretraining mechanism to improve representation performance. Since Fast-R2D2 (Hu et al, 2022;) satisfies all the above conditions and also has good inference speed, we choose Fast-R2D2 as our backbone.…”
Section: Essential Properties Of Structured Language Modelsmentioning
confidence: 99%
“…Because it has been verified in prior work Hu et al (2022) that models could achieve better downstream performance and domain-adaptivity by training along with the self-supervised objective L self (Φ), we design the final loss as follows:…”
Section: Training Objectivementioning
confidence: 99%
“…g can be seen as a structure encoder. With linguistic structures, for example, it usually takes the form of graph neural networks which encodes the structure ẑ. TreeLSTMs (Tai et al, 2015) have been widely used (Maillard et al, 2019;Choi et al, 2018), including variants that use different composition functions (Hu et al, 2021). With the advent of graph convolutional networks (Kipf and Welling, 2017), more works have based themselves on this architecture and its variants (Corro and Titov, 2019b;Wu et al, 2021).…”
Section: Latent Structure Predictionmentioning
confidence: 99%
“…Choi et al (2018) latently induced a constituency tree by greedily choosing a pair of neighboring text spans to merge at each layer using straight-through Gumbel-softmax at each layer for sampling. Hu et al (2021) used this method to design a heuristic pruning procedure that improves the O(n 3 ) complexity of the CKY algorithm to O(n). This efficiency also allowed them to pretrain this model with a language modeling objective.…”
Section: Linguistic Structuresmentioning
confidence: 99%