Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics 2019
DOI: 10.18653/v1/p19-1439
|View full text |Cite
|
Sign up to set email alerts
|

Can You Tell Me How to Get Past Sesame Street? Sentence-Level Pretraining Beyond Language Modeling

Abstract: Natural language understanding has recently seen a surge of progress with the use of sentence encoders like ELMo (Peters et al., 2018a) and BERT (Devlin et al., 2019) which are pretrained on variants of language modeling. We conduct the first large-scale systematic study of candidate pretraining tasks, comparing 19 different tasks both as alternatives and complements to language modeling. Our primary results support the use language modeling, especially when combined with pretraining on additional labeled-dat… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

4
52
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
2
1

Relationship

1
7

Authors

Journals

citations
Cited by 60 publications
(56 citation statements)
references
References 33 publications
4
52
0
Order By: Relevance
“…We use the GLUE benchmark tasks (Wang et al, 2018b) for training all the models. Such tasks are considered important for general linguistic intelligence, have lots of supervised data for many tasks and have been useful for transfer learning (Phang et al, 2018;Wang et al, 2018a). We consider the following tasks for training 3 : MNLI (m/mm), SST-2, QNLI, QQP, MRPC, RTE, and the SNLI dataset (Bowman et al, 2015).…”
Section: Training Tasksmentioning
confidence: 99%
See 1 more Smart Citation
“…We use the GLUE benchmark tasks (Wang et al, 2018b) for training all the models. Such tasks are considered important for general linguistic intelligence, have lots of supervised data for many tasks and have been useful for transfer learning (Phang et al, 2018;Wang et al, 2018a). We consider the following tasks for training 3 : MNLI (m/mm), SST-2, QNLI, QQP, MRPC, RTE, and the SNLI dataset (Bowman et al, 2015).…”
Section: Training Tasksmentioning
confidence: 99%
“…Implementation Details: Since dataset sizes can be imbalanced, it can affect multi-task and metalearning performance. Wang et al (2018a) analyze this in detail for multi-task learning. We explored sampling tasks with uniform probability, proportional to size and proportional to the square-root of the size of the task.…”
Section: Evaluation and Baselinesmentioning
confidence: 99%
“…Recent NLP work has also found that neural networks do not readily transfer knowledge across tasks; e.g., pretrained models often perform worse than non-pretrained models (Wang et al, 2019). This lack of generalization across tasks might be due to the tendency of multi-task neural networks to create largely independent representations for different tasks even when a shared representation could be used (Kirov and Frank, 2012).…”
Section: Will Models Generalize Acrossmentioning
confidence: 99%
“…To pursue this goal, we adopted an architecture (Wang et al, 2019) that accommodates pretrained semantic representations, comprising three levels: the input layer, the shared encoder layers and the task-specific model. For the top layer with task-specific information in each downstream task, we used the respective layer from GLUE.…”
Section: Training and Evaluationmentioning
confidence: 99%
“…To obtain the middle layer, encoding sentence semantics, we used a 2-layered biLSTM with dimensionality 1024. Instead of random initialization, the sentence encoder is first trained with one of the best performing pretraining tasks reported in (Wang et al, 2019), namely STS-B. 7 For the input layer, we experimented with each of the pretrained word embeddings discussed above in Sections 4 and 5: for each downstream task, different models were trained with the different pretrained word embeddings, and also with the baseline consisting of embeddings with random vectors.…”
Section: Training and Evaluationmentioning
confidence: 99%