b-bit minwise hashing in practice

Li, Ping; Shrivastava, Anshumali; König, Arnd Christian

doi:10.1145/2532443.2532446

Cited by 5 publications

(5 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…SimHash generates a single bit output (only the signs) whereas MinHash generates an integer value. Recently proposed b-bit minwise hashing [22] provides simple strategy to generate an informative single bit output from MinHash, by using the parity of MinHash values:…”

Section: -Bit Minwise Hashingmentioning

confidence: 99%

“…For example, the paper on Conditional Random Sampling (CRS) [19] showed that random projections can be very inaccurate especially in binary data, for the task of inner product estimation (which is not the same as near neighbor search). A more recent paper [26] empirically demonstrated that b-bit minwise hashing [22] outperformed SimHash and spectral hashing [30].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

In Defense of MinHash Over SimHash

Shrivastava,

2014

Preprint

Self Cite

View full text Add to dashboard Cite

MinHash and SimHash are the two widely adopted Locality Sensitive Hashing (LSH) algorithms for large-scale data processing applications. Deciding which LSH to use for a particular problem at hand is an important question, which has no clear answer in the existing literature. In this study, we provide a theoretical answer (validated by experiments) that MinHash virtually always outperforms SimHash when the data are binary, as common in practice such as search.We believe the results in this paper will provide valuable guidelines for search in practice, especially when the data are sparse.

show abstract

Section: -Bit Minwise Hashingmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

In Defense of MinHash Over SimHash

Shrivastava,

2014

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…In practice, one would not store the entire matrix of signs nor all the random permutations. In an implementation, hash functions [Carter and Wegman, 1979] would be used to create the matrix S deterministically, though it is beyond the scope of this paper to go into the details; see Li et al [2013] for more information and further computational improvements. With this approach, S would be created row-by-row, and only a single observation from X would need to be kept in memory at any one time.…”

Section: Construction Of S With B-bit Min-wise Hashing and Binary Var...mentioning

confidence: 99%

“…The empirical performance of regression and classification procedures following b-bit min-wise hashing [Li et al, , 2013 is particularly impressive. Existing theory on b-bit min-wise hashing has focused on the variance and bias in the approximation of the kernel.…”

Section: Introductionmentioning

confidence: 99%

On b-bit min-wise hashing for large-scale regression and classification with sparse data

Shah,

Meinshausen

2013

Preprint

View full text Add to dashboard Cite

Large-scale regression problems where both the number of variables, p, and the number of observations, n, may be large and in the order of millions or more, are becoming increasingly more common. Typically the data are sparse: only a fraction of a percent of the entries in the design matrix are non-zero. Nevertheless, often the only computationally feasible approach is to perform dimension reduction to obtain a new design matrix with far fewer columns and then work with this compressed data.b-bit min-wise hashing [Li and is a promising dimension reduction scheme for sparse matrices which produces a set of random features such that regression on the resulting design matrix approximates a kernel regression with the resemblance kernel. In this work, we derive bounds on the prediction error of such regressions. For both linear and logistic models, we show that the average prediction error vanishes asymptotically as long as q β * 2 2 /n → 0, where q is the average number of non-zero entries in each row of the design matrix and β * is the coefficient of the linear predictor.We also show that ordinary least squares or ridge regression applied to the reduced data can in fact allow us fit more flexible models. We obtain non-asymptotic prediction error bounds for interaction models and for models where an unknown row normalisation must be applied in order for the signal to be linear in the predictors.

show abstract

“…Many methods of document representation based on TF-IDF can construct Vector Space Model (VSM) of text corpus. Similarly, many methods of document representation exploit statistical term measures, such as BoS (Bag-of-Words) [3] and Minwise hashing [4]. For document representation, these methods are perceived as statistical methods of feature extraction.…”

Section: Introductionmentioning

confidence: 99%