2020
DOI: 10.48550/arxiv.2010.02194
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Self-training Improves Pre-training for Natural Language Understanding

Abstract: Unsupervised pre-training has led to much recent progress in natural language understanding. In this paper, we study self-training as another way to leverage unlabeled data through semi-supervised learning. To obtain additional data for a specific task, we introduce SentAugment, a data augmentation method which computes task-specific query embeddings from labeled data to retrieve sentences from a bank of billions of unlabeled sentences crawled from the web. Unlike previous semisupervised methods, our approach … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
12
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
4

Relationship

0
8

Authors

Journals

citations
Cited by 34 publications
(12 citation statements)
references
References 35 publications
0
12
0
Order By: Relevance
“…We present a simple and intuitive approach to semisupervised learning on (potentially) infinite streams of unlabeled data. Our approach integrates insights from different bodies of work including self-training [19,85], pseudolabelling [41,4,35], continual/iterated learning [38,39,74,75,67], and few-shot learning [44,28]. We demonstrate a number of surprising conclusions: (1) Unlabeled domainagnostic internet streams can be used to significantly improve models for specialized tasks and data domains, including surface normal prediction, semantic segmentation, and few-shot fine-grained image classification spanning diverse domains including medical, satellite, and agricultural imagery.…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…We present a simple and intuitive approach to semisupervised learning on (potentially) infinite streams of unlabeled data. Our approach integrates insights from different bodies of work including self-training [19,85], pseudolabelling [41,4,35], continual/iterated learning [38,39,74,75,67], and few-shot learning [44,28]. We demonstrate a number of surprising conclusions: (1) Unlabeled domainagnostic internet streams can be used to significantly improve models for specialized tasks and data domains, including surface normal prediction, semantic segmentation, and few-shot fine-grained image classification spanning diverse domains including medical, satellite, and agricultural imagery.…”
Section: Discussionmentioning
confidence: 99%
“…Self-Training and Semi-Supervised Learning: A large variety of self-training [19,85] and semi-supervised approaches [58,65,78,94,91,93] use unlabeled images in conjunction with labeled images to learn a better representation (Fig. 2-(b)).…”
Section: Go-tomentioning
confidence: 99%
“…As a general method to improve Transformer-based model performance on downstream tasks, [36] and [16] propose further language model pretraining in the target domain before the final fine-tuning. [11] proposes to use self-training as another way to leverage unlabeled data, where a teacher model is first trained on labeled data, and is then used to label large amount of in-domain unlabeled data for the student model to learn from. Recent developments in language model pretraining have also advanced state-of-the-art results on a wide range of NLP tasks.…”
Section: Improving Bert For Text Classificationmentioning
confidence: 99%
“…The approach has gain popularity in many applications. For example, in conjunction with pre-trained language models (Devlin et al, 2019), self-training has demonstrated superior performance on tasks such as natural language understanding (Du et al, 2020), named entity recognition (Liang et al, 2020), and question answering (Sachan and Xing, 2018).…”
Section: Introductionmentioning
confidence: 99%
“…Self-training is an effective complement to these pre-trained models. For example, combining self-training with pre-trained language models achieves superior performance in tasks such as text classification (Meng et al, 2020;Mukherjee and Awadallah, 2020;Du et al, 2020), named entity recognition (Liang et al, 2020), reading comprehension (Niu et al, 2020) and dialogue systems (Mi et al, 2021).…”
Section: Introductionmentioning
confidence: 99%