R2D2: Recursive Transformer based on Differentiable Tree for Interpretable Hierarchical Language Modeling

Hu, Xiang; Mi, Haitao; Wen, Zujie; Wang, Yafang; Su, Yi; Zheng, Jing; Melo, Gerard de

doi:10.18653/v1/2021.acl-long.379

Cited by 11 publications

(13 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Such remarks can also be made about the many probabilistic or non-probabilistic bottom-up, topdown, or left corner parsing algorithms which have been studied over the years as models of sentence processing (Earley, 1970;Rosenkrantz and Lewis, 1970;Marcus, 1978;Abney and Johnson, 1991;Berwick and Weinberg, 1982;Roark, 2001;Nivre, 2008;Stabler, 2013;Graf et al, 2017). Likewise for transformer or RNNbased parsing models (e.g., Costa, 2003;Jin and Schuler, 2020;Yang and Deng, 2020;Hu et al, 2021Hu et al, , 2022 or causal language models (Hochreiter and Schmidhuber, 1997;Radford et al, 2018Radford et al, , 2019Dai et al, 2019;Brown et al, 2020). The amount of work required by these algorithms to integrate or predict the next word scales in quantities such as the size of the vocabulary and the length of the input, but never directly as a function of the probability of the next word.…”

Section: Algorithms That Do Not Scale In Surprisalmentioning

confidence: 99%

The Plausibility of Sampling as an Algorithmic Theory of Sentence Processing

Hoover¹,

Sonderegger²,

Piantadosi³

et al. 2022

Preprint

View full text Add to dashboard Cite

Words that are more surprising given context take longer to process. However, no incremental parsing algorithm has been shown to directly predict this phenomenon. In this work, we focus on a class of algorithms whose runtime does naturally scale in surprisal---sampling algorithms. Our first contribution is to show that simple examples of such algorithms predict runtime to increase superlinearly with surprisal, and also predict variance in runtime to increase. These two predictions stand in contrast with literature on surprisal theory (Hale, 2001; Levy, 2008), which assumes that the expected processing cost increases linearly with surprisal, and makes no prediction about variance. In the second part of this paper, we conduct an empirical study of the relationship between surprisal and reading time, using a collection of modern language models to estimate surprisal. We find that with better language models, reading time increases superlinearly in surprisal, and also that variance increases. These results are consistent with the predictions of sampling-based algorithms.

show abstract

Section: Algorithms That Do Not Scale In Surprisalmentioning

confidence: 99%

The Plausibility of Sampling as an Algorithmic Theory of Sentence Processing

Hoover¹,

Sonderegger²,

Piantadosi³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Thirdly, it has a pretraining mechanism to improve representation performance. Since Fast-R2D2 (Hu et al, 2022;) satisfies all the above conditions and also has good inference speed, we choose Fast-R2D2 as our backbone.…”

Section: Essential Properties Of Structured Language Modelsmentioning

confidence: 99%

“…Because it has been verified in prior work Hu et al (2022) that models could achieve better downstream performance and domain-adaptivity by training along with the self-supervised objective L self (Φ), we design the final loss as follows:…”

Section: Training Objectivementioning

confidence: 99%

A Multi-Grained Self-Interpretable Symbolic-Neural Model For Single/Multi-Labeled Text Classification

Xiang¹,

Kong²,

Tu³

2023

Preprint

View full text Add to dashboard Cite

Deep neural networks based on layer-stacking architectures have historically suffered from poor inherent interpretability. Meanwhile, symbolic probabilistic models function with clear interpretability, but how to combine them with neural networks to enhance their performance remains to be explored. In this paper, we try to marry these two systems for text classification via a structured language model. We propose a Symbolic-Neural model that can learn to explicitly predict class labels of text spans from a constituency tree without requiring any access to spanlevel gold labels. As the structured language model learns to predict constituency trees in a self-supervised manner, only raw texts and sentence-level labels are required as training data, which makes it essentially a general constituent-level self-interpretable classification model. Our experiments demonstrate that our approach could achieve good prediction accuracy in downstream tasks. Meanwhile, the predicted span labels are consistent with human rationales to a certain degree.

show abstract

“…g can be seen as a structure encoder. With linguistic structures, for example, it usually takes the form of graph neural networks which encodes the structure ẑ. TreeLSTMs (Tai et al, 2015) have been widely used (Maillard et al, 2019;Choi et al, 2018), including variants that use different composition functions (Hu et al, 2021). With the advent of graph convolutional networks (Kipf and Welling, 2017), more works have based themselves on this architecture and its variants (Corro and Titov, 2019b;Wu et al, 2021).…”

Section: Latent Structure Predictionmentioning

confidence: 99%

“…Choi et al (2018) latently induced a constituency tree by greedily choosing a pair of neighboring text spans to merge at each layer using straight-through Gumbel-softmax at each layer for sampling. Hu et al (2021) used this method to design a heuristic pruning procedure that improves the O(n 3 ) complexity of the CKY algorithm to O(n). This efficiency also allowed them to pretrain this model with a language modeling objective.…”

Section: Linguistic Structuresmentioning

confidence: 99%

Learning with Latent Structures in Natural Language Processing: A Survey

Wu¹

2022

Preprint

View full text Add to dashboard Cite

While end-to-end learning with fully differentiable models has enabled tremendous success in natural language process (NLP) and machine learning, there have been significant recent interests in learning with latent discrete structures to incorporate better inductive biases for improved end-task performance and better interpretability. This paradigm, however, is not straightforwardly amenable to the mainstream gradient-based optimization methods. This work surveys three main families of methods to learn such models: surrogate gradients, continuous relaxation, and marginal likelihood maximization via sampling. We conclude with a review of applications of these methods and an inspection of the learned latent structure that they induce. 1

show abstract

R2D2: Recursive Transformer based on Differentiable Tree for Interpretable Hierarchical Language Modeling

Cited by 11 publications

References 24 publications

The Plausibility of Sampling as an Algorithmic Theory of Sentence Processing

The Plausibility of Sampling as an Algorithmic Theory of Sentence Processing

A Multi-Grained Self-Interpretable Symbolic-Neural Model For Single/Multi-Labeled Text Classification

Learning with Latent Structures in Natural Language Processing: A Survey

Contact Info

Product

Resources

About