2020
DOI: 10.48550/arxiv.2011.01403
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Supervised Contrastive Learning for Pre-trained Language Model Fine-tuning

Abstract: State-of-the-art natural language understanding classification models follow twostages: pre-training a large language model on an auxiliary task, and then finetuning the model on a task-specific labeled dataset using cross-entropy loss. Crossentropy loss has several shortcomings that can lead to sub-optimal generalization and instability. Driven by the intuition that good generalization requires capturing the similarity between examples in one class and contrasting them with examples in other classes, we propo… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

3
56
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 41 publications
(59 citation statements)
references
References 18 publications
3
56
0
Order By: Relevance
“…Our experiments show consistent performance improvement over baseline when using SoftTriple loss, which is highest for the small and medium-sized datasets and decreases for the large and extra-large sized datasets. It is a significant improvement over previous related work, where the performance improvement for the supervised classification tasks was achieved only for the few-shot learning settings (Gunel et al, 2020) We also conclude that the smaller the dataset is, the higher our new goal function's performance gain over baseline. The performance comparison between baseline and our method throughout dataset size is depicted in Figure 1.…”
Section: Discussionmentioning
confidence: 49%
See 2 more Smart Citations
“…Our experiments show consistent performance improvement over baseline when using SoftTriple loss, which is highest for the small and medium-sized datasets and decreases for the large and extra-large sized datasets. It is a significant improvement over previous related work, where the performance improvement for the supervised classification tasks was achieved only for the few-shot learning settings (Gunel et al, 2020) We also conclude that the smaller the dataset is, the higher our new goal function's performance gain over baseline. The performance comparison between baseline and our method throughout dataset size is depicted in Figure 1.…”
Section: Discussionmentioning
confidence: 49%
“…Our implementation is a development of the earlier work (Gunel et al, 2020), where Contrastive Loss was applied only to the embedding corresponding to the first [CLS] token of the input vector x i . We apply SoftTriple Loss to the embeddings corresponding to all tokens from the input vector x i , which ensures the better generalization of the fine-tuning process but requires more computing power.…”
Section: Softtriple Lossmentioning
confidence: 99%
See 1 more Smart Citation
“…Among all possible similarity learning methods, contrastive learning (Chopra et al, 2005;Hadsell et al, 2006;Oord et al, 2018) has become one of the most prominent supervised (Khosla et al, 2020;Gunel et al, 2020) and selfsupervised (Bachman et al, 2019;Tian et al, 2020a;He et al, 2020;Chen et al, 2020) ML techniques to learn representations of high-dimensional data, producing impressive results in several fields (Le-Khac et al, 2020;Jaiswal et al, 2021). Despite its success, contrastive learning usually requires huge datasets often created using data augmentation techniques.…”
Section: Related Workmentioning
confidence: 99%
“…Bayesian Personalized Ranking [1] [2] was one of the initial methods that used the triplet loss for personalized recommender systems. The triplet loss [3] is similar to the contrastive loss [4] [5] [6] but uses a triplet instead of two sets of embeddings. In this work, we use the triplet loss in a multi-task learning method to learn named entity recognition.…”
Section: Introductionmentioning
confidence: 99%