Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining 2018
DOI: 10.1145/3159652.3159695
|View full text |Cite
|
Sign up to set email alerts
|

Improving Negative Sampling for Word Representation using Self-embedded Features

Abstract: Although the word-popularity based negative sampler has shown superb performance in the skip-gram model, the theoretical motivation behind oversampling popular (non-observed) words as negative samples is still not well understood. In this paper, we start from an investigation of the gradient vanishing issue in the skip-gram model without a proper negative sampler. By performing an insightful analysis from the stochastic gradient descent (SGD) learning perspective, we demonstrate that, both theoretically and in… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
34
0

Year Published

2018
2018
2024
2024

Publication Types

Select...
5
2

Relationship

1
6

Authors

Journals

citations
Cited by 24 publications
(34 citation statements)
references
References 32 publications
0
34
0
Order By: Relevance
“…• SG: This is the original skip-gram model with SGD and negative sampling (Mikolov et al, 2013a,b). • SGA: This is the skip-gram model with an adaptive sampler (Chen et al, 2018). • GloVe: This method applies biased MF on the positive samples of word co-occurrence matrix (Pennington et al, 2014).…”
Section: Methodsmentioning
confidence: 99%
See 3 more Smart Citations
“…• SG: This is the original skip-gram model with SGD and negative sampling (Mikolov et al, 2013a,b). • SGA: This is the skip-gram model with an adaptive sampler (Chen et al, 2018). • GloVe: This method applies biased MF on the positive samples of word co-occurrence matrix (Pennington et al, 2014).…”
Section: Methodsmentioning
confidence: 99%
“…for each model are given from left to right as follows. SG: subsampling of frequent words, window size and the number of negative samples; SGA: λ (Chen et al, 2018) that controls the distribution of the rank, the other parameters are the same with SG; GloVe: xmax, window size and symmetric window; LexVec: subsampling of frequent words and the number of negative samples; AllVec: the negative weight α0 and δ. Boldface denotes the highest total accuracy. Figure 2(a) shows the impact of the overall weight α 0 by setting δ as 0.75 (inspired by the setting of skip-gram).…”
Section: Impact Of α − Cmentioning
confidence: 99%
See 2 more Smart Citations
“…Contrary to conventional generic sampling losses such as Sampled Softmax or Noise Contrastive Estimation, we advocate that negative sampling schemes should not try to approximate the softmax function but rather act as a way to induce an informative bias on the model of negatives. To this end, authors in [7] provide an insightful analysis of negative sampling and show that negative samples that have high inner product scores with the context word provide more informative gradients. PU learning approaches have already shown some successes in the context of Recommender Systems as shown in [20].…”
Section: Negative Sampling From a Pu Learning Perspectivementioning
confidence: 99%