Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua 2021
DOI: 10.18653/v1/2021.naacl-main.426
|View full text |Cite
|
Sign up to set email alerts
|

Self-training Improves Pre-training for Natural Language Understanding

Abstract: Unsupervised pre-training has led to much recent progress in natural language understanding. In this paper, we study self-training as another way to leverage unlabeled data through semi-supervised learning. To obtain additional data for a specific task, we introduce SentAugment, a data augmentation method which computes task-specific query embeddings from labeled data to retrieve sentences from a bank of billions of unlabeled sentences crawled from the web. Unlike previous semisupervised methods, our approach … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
54
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
2
2

Relationship

0
8

Authors

Journals

citations
Cited by 71 publications
(66 citation statements)
references
References 28 publications
1
54
0
Order By: Relevance
“…Self-training. We use the finetunned models with labeled data as teachers to generate pseudo soft labels on unlabeled data following (Du et al, 2021). Pseudo labeled data are combined with original labeled data to trained student models by optimizing objective function in Equation 2.…”
Section: Methodsmentioning
confidence: 99%
See 2 more Smart Citations
“…Self-training. We use the finetunned models with labeled data as teachers to generate pseudo soft labels on unlabeled data following (Du et al, 2021). Pseudo labeled data are combined with original labeled data to trained student models by optimizing objective function in Equation 2.…”
Section: Methodsmentioning
confidence: 99%
“…He et al (2020) injected noise to the input space as a noisy version of self-training for neural sequence generation and obtained state-of-the-art performance for tasks like neural machine translation. Du et al (2021) utilized information retrieval to retrieve task-specific in-domain data from a large bank of web sentences for self-training. Beyond these applications of self-training, Wei et al (2021) further theoretically proved that self-training and input-consistency regularization will achieve high accuracy with respect to ground-truth labels under certain assumptions.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Huang et al (2019) explored a multitask sentence encoding model for semantic retrieval in QA systems. Du et al (2021) introduced SentAugment, a data augmentation method that computes task-specific query embeddings from labeled data to retrieve sentences from a bank of billions of unlabeled sentences crawled from the web. Yang et al (2020) uses the Universal Sentence Encoder (USE) for semantic similarity and semantic retrieval in a multilingual setting.…”
Section: Sentence-level Novelty Detectionmentioning
confidence: 99%
“…Currently, best sentence-embeddings approaches are supervisedly trained using large labeled datasets (Conneau et al, 2017;Cer et al, 2018;Reimers and Gurevych, 2019; 1 Code available at https://github.com/ marco-digio/Twitter4SSE Du et al, 2021;Wieting et al, 2020;Huang et al, 2021), such as NLI datasets (Bowman et al, 2015;Williams et al, 2018) or paraphrase corpora (Dolan and Brockett, 2005). Round-trip translation has been also exploited, where semantically similar pairs of sentences are generated translating the non-English side of NMT pairs, as in ParaNMT (Wieting and Gimpel, 2018) and Opusparcus (Creutz, 2018).…”
Section: Introduction and Related Workmentioning
confidence: 99%