Context Analysis for Pre-trained Masked Language Models

Lai, Yi-An; Lalwani, Garima; Zhang, Yi

doi:10.18653/v1/2020.findings-emnlp.338

Cited by 11 publications

(10 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Sun et al (2021) reveal that Longformer and Routing transformers can only reduce the perplexity of LMs on a small set of tokens. More related to our work, Lai et al (2020) show that BERT can make use of a larger scope of context than a BiLSTM.…”

Section: Benchmarks and Analysissupporting

confidence: 53%

The NLP Task Effectiveness of Long-Range Transformers

Qin¹,

Feng²,

Durme³

2022

Preprint

View full text Add to dashboard Cite

Transformer models cannot easily scale to long sequences due to their O(N 2 ) time and space complexity. This has led to Transformer variants seeking to lessen computational complexity, such as Longformer and Performer. While such models has theoretically greater efficiency, their effectiveness on real NLP tasks has not been well studied. We benchmark 7 variants of Transformer models on 5 difficult NLP tasks and 7 datasets. We design experiments to isolate the effect of pretraining and hyperparameter settings, to focus on their capacity for long-range attention. Moreover, we present various methods to investigate attention behaviors, to illuminate model details beyond metric scores. We find that attention of long-range transformers has advantages on content selection and query-guided decoding, but they come with previously unrecognized drawbacks such as insufficient attention to distant tokens.

show abstract

Section: Benchmarks and Analysissupporting

confidence: 53%

The NLP Task Effectiveness of Long-Range Transformers

Qin¹,

Feng²,

Durme³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…show that Compressive Transformer improves the performance of infrequent tokens. Our work also relates to that of Lai et al (2020), who investigate the impact of context for pretrained masked LMs. More recently, Press et al (2020) also observe negligible benefits of long-term context; we step further in this direction by exploring larger models with more fine-grained analysis.…”

Section: Related Workmentioning

confidence: 96%

Do Long-Range Language Models Actually Use Long-Range Context?

Sun

Krishna

Mattarella-Micke

et al. 2021

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

Language models are generally trained on short, truncated input sequences, which limits their ability to use discourse-level information present in long-range context to improve their predictions. Recent efforts to improve the efficiency of self-attention have led to a proliferation of long-range Transformer language models, which can process much longer sequences than models of the past. However, the ways in which such models take advantage of the longrange context remain unclear. In this paper, we perform a fine-grained analysis of two longrange Transformer language models (including the Routing Transformer, which achieves state-of-the-art perplexity on the PG-19 longsequence LM benchmark dataset) that accept input sequences of up to 8K tokens. Our results reveal that providing long-range context (i.e., beyond the previous 2K tokens) to these models only improves their predictions on a small set of tokens (e.g., those that can be copied from the distant context) and does not help at all for sentence-level prediction tasks. Finally, we discover that PG-19 contains a variety of different document types and domains, and that long-range context helps most for literary novels (as opposed to textbooks or magazines).

show abstract

“…The intuition is, for example, if a pretrained encoder has learned to discard the input information, we cannot expect the encoder to perform well when transferred to any tasks. Also, existing studies show that neural language models assign more importance to local context when they make predictions (Khandelwal et al, 2018;Lai et al, 2020). Can we observe that encoders pretrained with artificial languages exhibit similar patterns to natural languages regarding how they encode the contextual information?…”

Section: Resultsmentioning

confidence: 74%

Pretraining with Artificial Language: Studying Transferable Knowledge in Language Models

Ri¹,

Tsuruoka²

2022

Preprint

View full text Add to dashboard Cite

We investigate what kind of structural knowledge learned in neural network encoders is transferable to processing natural language. We design artificial languages with structural properties that mimic natural language, pretrain encoders on the data, and see how much performance the encoder exhibits on downstream tasks in natural language. Our experimental results show that pretraining with an artificial language with a nesting dependency structure provides some knowledge transferable to natural language. A follow-up probing analysis indicates that its success in the transfer is related to the amount of encoded contextual information and what is transferred is the knowledge of position-aware context dependence of language. Our results provide insights into how neural network encoders process human languages and the source of crosslingual transferability of recent multilingual language models.

show abstract

Context Analysis for Pre-trained Masked Language Models

Cited by 11 publications

References 30 publications

The NLP Task Effectiveness of Long-Range Transformers

The NLP Task Effectiveness of Long-Range Transformers

Do Long-Range Language Models Actually Use Long-Range Context?

Pretraining with Artificial Language: Studying Transferable Knowledge in Language Models

Contact Info

Product

Resources

About