Improving Negative Sampling for Word Representation using Self-embedded Features

Chen, Long; Yuan, Fajie; Jose, Joemon M.; Zhang, Weinan

doi:10.1145/3159652.3159695

Cited by 24 publications

(34 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…• SG: This is the original skip-gram model with SGD and negative sampling (Mikolov et al, 2013a,b). • SGA: This is the skip-gram model with an adaptive sampler (Chen et al, 2018). • GloVe: This method applies biased MF on the positive samples of word co-occurrence matrix (Pennington et al, 2014).…”

Section: Methodsmentioning

confidence: 99%

“…for each model are given from left to right as follows. SG: subsampling of frequent words, window size and the number of negative samples; SGA: λ (Chen et al, 2018) that controls the distribution of the rank, the other parameters are the same with SG; GloVe: xmax, window size and symmetric window; LexVec: subsampling of frequent words and the number of negative samples; AllVec: the negative weight α0 and δ. Boldface denotes the highest total accuracy. Figure 2(a) shows the impact of the overall weight α 0 by setting δ as 0.75 (inspired by the setting of skip-gram).…”

Section: Impact Of α − Cmentioning

confidence: 99%

“…By far, most state-of-the-art embedding methods rely on SGD and negative sampling for optimization. However, the performance of SGD is highly sensitive to the sampling distribution and the number of negative samples (Chen et al, 2018;Yuan et al, 2016), as shown in Figure 1. Essentially, sampling is biased, making it difficult to converge to the same loss with all examples, regardless of how many update steps have been taken.…”

Section: Introductionmentioning

confidence: 99%

“…This suggests that the sampling distribution (of negative words) has a great impact on the embedding quality. Furthermore, Chen et al (2018) and Guo et al (2018) recently found that replacing the original sampler with adaptive samplers could result in better performance. The adaptive samplers are used to find more informative negative examples during the training process.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Batch IS NOT Heavy: Learning Word Representations From All Samples

Xin

Yuan

et al. 2018

Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Self Cite

View full text Add to dashboard Cite

Stochastic Gradient Descent (SGD) with negative sampling is the most prevalent approach to learn word representations. However, it is known that sampling methods are biased especially when the sampling distribution deviates from the true data distribution. Besides, SGD suffers from dramatic fluctuation due to the onesample learning scheme. In this work, we propose AllVec that uses batch gradient learning to generate word representations from all training samples. Remarkably, the time complexity of AllVec remains at the same level as SGD, being determined by the number of positive samples rather than all samples. We evaluate AllVec on several benchmark tasks. Experiments show that AllVec outperforms samplingbased SGD methods with comparable efficiency, especially for small training corpora.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Impact Of α − Cmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Batch IS NOT Heavy: Learning Word Representations From All Samples

Xin

Yuan

et al. 2018

Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Self Cite

View full text Add to dashboard Cite

show abstract

“…Contrary to conventional generic sampling losses such as Sampled Softmax or Noise Contrastive Estimation, we advocate that negative sampling schemes should not try to approximate the softmax function but rather act as a way to induce an informative bias on the model of negatives. To this end, authors in [7] provide an insightful analysis of negative sampling and show that negative samples that have high inner product scores with the context word provide more informative gradients. PU learning approaches have already shown some successes in the context of Recommender Systems as shown in [20].…”

Section: Negative Sampling From a Pu Learning Perspectivementioning

confidence: 99%

Relaxed softmax for PU learning

Tanielian¹,

Vasile

2019

Proceedings of the 13th ACM Conference on Recommender Systems

View full text Add to dashboard Cite

In recent years, the softmax model and its fast approximations have become the de-facto loss functions for deep neural networks when dealing with multi-class prediction. This loss has been extended to language modeling and recommendation, two fields that fall into the framework of learning from Positive and Unlabeled data.In this paper, we stress the different drawbacks of the current family of softmax losses and sampling schemes when applied in a Positive and Unlabeled learning setup. We propose both a Relaxed Softmax loss (RS) and a new negative sampling scheme based on Boltzmann formulation. We show that the new training objective is better suited for the tasks of density estimation, item similarity and next-event prediction by driving uplifts in performance on textual and recommendation datasets against classical softmax. CCS CONCEPTS• Theory of computation → Models of learning; Query learning; Semi-supervised learning. sectionIntroductionOne-class learning is a well-known paradigm where one tries to identify objects of a specific class amongst all objects. Its applications are numerous in language modeling and recommendation, since in many occasions negative data is either too expensive to obtain or too difficult to define. Depending on the availability of training data, we make the distinction between learning from positive-only data and learning from positive and unlabeled data. In the positive-only case, one can use the softmax crossentropy loss and train a neural model to fit a target density. The softmax cross-entropy loss is a density estimation tool that allows to fit the distribution output by the model to the empirical distribution

show abstract

Simple and automated negative sampling for knowledge graph embedding

Zhang

Yao²,

Lei

2021

The VLDB Journal

View full text Add to dashboard Cite

Negative sampling, which samples negative triplets from non-observed ones in knowledge graph (KG), is an essential step in KG embedding. Recently, generative adversarial network (GAN), has been introduced in negative sampling. By sampling negative triplets with large gradients, these methods avoid the problem of vanishing gradient and thus obtain better performance. However, they make the original model more complex and harder to train. In this paper, motivated by the observation that negative triplets with large gradients are important but rare, we propose to directly keep track of them with the cache. In this way, our method acts as a "distilled" version of previous GAN-based methods, which does not waste training time on additional parameters to fit the full distribution of negative triplets. However, how to sample from and update the cache are two critical questions. We propose to solve these issues by automated machine learning techniques. The automated version also covers GAN-based methods as special cases. Theoretical explanation of NSCaching is also provided, justifying the superior over fixed sampling scheme. Besides, we further extend NSCaching with skip-gram model for graph embedding. Finally, extensive experiments show that our method can gain significant improvements on various Yongqi Zhang 4Paradigm Inc.

show abstract

Improving Negative Sampling for Word Representation using Self-embedded Features

Cited by 24 publications

References 32 publications

Batch IS NOT Heavy: Learning Word Representations From All Samples

Batch IS NOT Heavy: Learning Word Representations From All Samples

Relaxed softmax for PU learning

Simple and automated negative sampling for knowledge graph embedding

Contact Info

Product

Resources

About