Can You Tell Me How to Get Past Sesame Street? Sentence-Level Pretraining Beyond Language Modeling

Wang, Alex; Hůla, Jan; Xia, Patrick; Pappagari, Raghavendra; McCoy, R. Thomas; Patel, Roma; Kim, Najoung; Tenney, Ian; Huang, Yinghui; Yu, Katherin; Jin, Shuning; Chen, Berlin; Durme, Benjamin Van; Grave, Édouard; Pavlick, Ellie; Bowman, Samuel R.

doi:10.18653/v1/p19-1439

Cited by 60 publications

(56 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We use the GLUE benchmark tasks (Wang et al, 2018b) for training all the models. Such tasks are considered important for general linguistic intelligence, have lots of supervised data for many tasks and have been useful for transfer learning (Phang et al, 2018;Wang et al, 2018a). We consider the following tasks for training 3 : MNLI (m/mm), SST-2, QNLI, QQP, MRPC, RTE, and the SNLI dataset (Bowman et al, 2015).…”

Section: Training Tasksmentioning

confidence: 99%

See 1 more Smart Citation

Learning to Few-Shot Learn Across Diverse Natural Language Classification Tasks

Bansal¹,

Jha²,

McCallum³

2020

Proceedings of the 28th International Conference on Computational Linguistics

View full text Add to dashboard Cite

Pre-trained transformer models have shown enormous success in improving performance on several downstream tasks. However, fine-tuning on a new task still requires large amounts of taskspecific labeled data to achieve good performance. We consider this problem of learning to generalize to new tasks with a few examples as a meta-learning problem. While meta-learning has shown tremendous progress in recent years, its application is still limited to simulated problems or problems with limited diversity across tasks. We develop a novel method, LEOPARD, which enables optimization-based meta-learning across tasks with different number of classes, and evaluate different methods on generalization to diverse NLP classification tasks. LEOP-ARD is trained with the state-of-the-art transformer architecture and shows better generalization to tasks not seen at all during training, with as few as 4 examples per label. Across 17 NLP tasks, including diverse domains of entity typing, natural language inference, sentiment analysis, and several other text classification tasks, we show that LEOPARD learns better initial parameters for few-shot learning than self-supervised pre-training or multi-task training, outperforming many strong baselines, for example, yielding 14.6% average relative gain in accuracy on unseen tasks with only 4 examples per label.

show abstract

Section: Training Tasksmentioning

confidence: 99%

“…Implementation Details: Since dataset sizes can be imbalanced, it can affect multi-task and metalearning performance. Wang et al (2018a) analyze this in detail for multi-task learning. We explored sampling tasks with uniform probability, proportional to size and proportional to the square-root of the size of the task.…”

Section: Evaluation and Baselinesmentioning

confidence: 99%

Learning to Few-Shot Learn Across Diverse Natural Language Classification Tasks

Bansal¹,

Jha²,

McCallum³

2020

Proceedings of the 28th International Conference on Computational Linguistics

View full text Add to dashboard Cite

show abstract

“…Recent NLP work has also found that neural networks do not readily transfer knowledge across tasks; e.g., pretrained models often perform worse than non-pretrained models (Wang et al, 2019). This lack of generalization across tasks might be due to the tendency of multi-task neural networks to create largely independent representations for different tasks even when a shared representation could be used (Kirov and Frank, 2012).…”

Section: Will Models Generalize Acrossmentioning

confidence: 99%

Does Syntax Need to Grow on Trees? Sources of Hierarchical Inductive Bias in Sequence-to-Sequence Networks

McCoy

Frank

Linzen

2020

Transactions of the Association for Computational Linguistics

Self Cite

View full text Add to dashboard Cite

Learners that are exposed to the same training data might generalize differently due to differing inductive biases. In neural network models, inductive biases could in theory arise from any aspect of the model architecture. We investigate which architectural factors affect the generalization behavior of neural sequence-to-sequence models trained on two syntactic tasks, English question formation and English tense reinflection. For both tasks, the training set is consistent with a generalization based on hierarchical structure and a generalization based on linear order. All architectural factors that we investigated qualitatively affected how models generalized, including factors with no clear connection to hierarchical structure. For example, LSTMs and GRUs displayed qualitatively different inductive biases. However, the only factor that consistently contributed a hierarchical bias across tasks was the use of a tree-structured model rather than a model with sequential recurrence, suggesting that human-like syntactic generalization requires architectural syntactic structure.

show abstract

“…To pursue this goal, we adopted an architecture (Wang et al, 2019) that accommodates pretrained semantic representations, comprising three levels: the input layer, the shared encoder layers and the task-specific model. For the top layer with task-specific information in each downstream task, we used the respective layer from GLUE.…”

Section: Training and Evaluationmentioning

confidence: 99%

“…To obtain the middle layer, encoding sentence semantics, we used a 2-layered biLSTM with dimensionality 1024. Instead of random initialization, the sentence encoder is first trained with one of the best performing pretraining tasks reported in (Wang et al, 2019), namely STS-B. 7 For the input layer, we experimented with each of the pretrained word embeddings discussed above in Sections 4 and 5: for each downstream task, different models were trained with the different pretrained word embeddings, and also with the baseline consisting of embeddings with random vectors.…”

Section: Training and Evaluationmentioning

confidence: 99%

Comparative Probing of Lexical Semantics Theories for Cognitive Plausibility and Technological Usefulness

Branco

Rodrigues

Salawa³

et al. 2020

Proceedings of the 28th International Conference on Computational Linguistics

View full text Add to dashboard Cite

Lexical semantics theories differ in advocating that the meaning of words is represented as an inference graph, a feature mapping or a vector space, thus raising the question: is it the case that one of these approaches is superior to the others in representing lexical semantics appropriately? Or in its non antagonistic counterpart: could there be a unified account of lexical semantics where these approaches seamlessly emerge as (partial) renderings of (different) aspects of a core semantic knowledge base? In this paper, we contribute to these research questions with a number of experiments that systematically probe different lexical semantics theories for their levels of cognitive plausibility and of technological usefulness. The empirical findings obtained from these experiments advance our insight on lexical semantics as the feature-based approach emerges as superior to the other ones, and arguably also move us closer to finding answers to the research questions above.

show abstract

Can You Tell Me How to Get Past Sesame Street? Sentence-Level Pretraining Beyond Language Modeling

Cited by 60 publications

References 33 publications

Learning to Few-Shot Learn Across Diverse Natural Language Classification Tasks

Learning to Few-Shot Learn Across Diverse Natural Language Classification Tasks

Does Syntax Need to Grow on Trees? Sources of Hierarchical Inductive Bias in Sequence-to-Sequence Networks

Comparative Probing of Lexical Semantics Theories for Cognitive Plausibility and Technological Usefulness

Contact Info

Product

Resources

About