A New Unbiased and Efficient Class of LSH-Based Samplers and Estimators for Partition Function Computation in Log-Linear Models

Spring, Ryan; Shrivastava, Anshumali

doi:10.48550/arxiv.1703.05160

Cited by 13 publications

(21 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this paper, we change this. Our work provides a truly constant time adaptive sampling scheme utilizing the recent advances in Locality Sensitive Sampling [14,15]. More impressively, we provide an efficient implementation of our proposal on CPU, which outperforms TensorFlow's implementation of softmax and other negative sampling strategies on some of the best available GPUs (V100) in terms of wall-clock training time.…”

Section: Negative Samplingmentioning

confidence: 93%

See 1 more Smart Citation

A Tale of Two Efficient and Informative Negative Sampling Distributions

Daghaghi,

Medini,

Meisburger

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

Softmax classifiers with a very large number of classes naturally occur in many applications such as natural language processing and information retrieval. The calculation of full-softmax is very expensive from the computational and energy perspective. There have been a variety of sampling approaches to overcome this challenge, popularly known as negative sampling (NS). Ideally, NS should sample negative classes from a distribution that is dependent on the input data, the current parameters, and the correct positive class. Unfortunately, due to the dynamically updated parameters and data samples, there does not exist any sampling scheme that is truly adaptive and also samples the negative classes in constant time every iteration. Therefore, alternative heuristics like random sampling, static frequency-based sampling, or learning-based biased sampling, which primarily trade either the sampling cost or the adaptivity of samples per iteration, are adopted. In this paper, we show a class of distribution where the sampling scheme is truly adaptive and provably generates negative samples in constant time. Our implementation in C++ on commodity CPU is significantly faster, in terms of wall clock time, compared to the most optimized TensorFlow implementations of standard softmax or other sampling approaches on modern GPUs (V100s).

show abstract

Section: Negative Samplingmentioning

confidence: 93%

“…In this section, we briefly describe the recent development of using locality sensitive hashing for sampling and estimation [14,15,16,17]. Locality Sensitive Hashing [18,19] is a widely used paradigm for large scale similarity search and nearest neighbor search.…”

Section: Lsh Based Hash Tablesmentioning

confidence: 99%

A Tale of Two Efficient and Informative Negative Sampling Distributions

Daghaghi,

Medini,

Meisburger

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Most existing works for reducing the softmax inference complexity are based on post-approximation of a fixed softmax that has been trained in a standard procedure. Locality Sensitive Hashing (LSH) has been demonstrated as a powerful technique under this category [9,[15][16][17]. Small word graph is another powerful technique for this problem [11].…”

Section: Related Workmentioning

confidence: 99%

Doubly Sparse: Sparse Mixture of Sparse Experts for Efficient Softmax Inference

Liao,

Chen,

Lin

et al. 2019

Preprint

View full text Add to dashboard Cite

Computations for the softmax function are significantly expensive when the number of output classes is large. In this paper, we present a novel softmax inference speedup method, Doubly Sparse Softmax (DS-Softmax), that leverages sparse mixture of sparse experts to efficiently retrieve top-k classes. Different from most existing methods that require and approximate a fixed softmax, our method is learning-based and can adapt softmax weights for a better inference speedup. In particular, our method learns a two-level hierarchy which divides entire output class space into several partially overlapping experts. Each expert is sparse and only contains a subset of output classes. To find top-k classes, a sparse mixture enables us to find the most probable expert quickly, and the sparse expert enables us to search within a small-scale softmax. We empirically conduct evaluation on several real-world tasks, including neural machine translation, language modeling and image classification, and demonstrate that significant computation reductions can be achieved at no performance loss.Preprint. Under review.

show abstract

“…These methods operate offline since efficient adaptive sampling on streaming data is a challenging problem. Recently, locality-sensitive hashing has been used as a fast adaptive sampler for the KDE problem [5,35]. In particular, the hashing-based estimator (HBE) introduced by [5] has strong theoretical guarantees for KDE, even in high dimensions.…”

Section: Related Workmentioning

confidence: 99%

“…While LSH was originally introduced for the high dimensional nearest-neighbor search problem, the technique has also recently been applied to unbiased statistical estimation via adaptive sampling for a variety of functions [5,35]. Our KDE method will use the RACE algorithm, which views LSH as a slightly different kind of statistical estimator [25].…”

Section: Repeated Array-of-counts Estimator (Race)mentioning

confidence: 99%

Sub-linear RACE Sketches for Approximate Kernel Density Estimation on Streaming Data

Coleman¹,

Shrivastava²

2019

Preprint

Self Cite

View full text Add to dashboard Cite

Kernel density estimation is a simple and effective method that lies at the heart of many important machine learning applications. Unfortunately, kernel methods scale poorly for large, high dimensional datasets. Approximate kernel density estimation has a prohibitively high memory and computation cost, especially in the streaming setting. Recent sampling algorithms for high dimensional densities can reduce the computation cost but cannot operate online, while streaming algorithms cannot handle high dimensional datasets due to the curse of dimensionality. We propose RACE, an efficient sketching algorithm for kernel density estimation on high-dimensional streaming data. RACE compresses a set of N high dimensional vectors into a small array of integer counters. This array is sufficient to estimate the kernel density for a large class of kernels. Our sketch is practical to implement and comes with strong theoretical guarantees. We evaluate our method on real-world highdimensional datasets and show that our sketch achieves 10x better compression compared to competing methods. CCS CONCEPTS• Theory of computation → Sketching and sampling.

show abstract

A New Unbiased and Efficient Class of LSH-Based Samplers and Estimators for Partition Function Computation in Log-Linear Models

Cited by 13 publications

References 6 publications

A Tale of Two Efficient and Informative Negative Sampling Distributions

A Tale of Two Efficient and Informative Negative Sampling Distributions

Doubly Sparse: Sparse Mixture of Sparse Experts for Efficient Softmax Inference

Sub-linear RACE Sketches for Approximate Kernel Density Estimation on Streaming Data

Contact Info

Product

Resources

About