Proceedings of the 2013 Conference on the Theory of Information Retrieval 2013
DOI: 10.1145/2499178.2499189
|View full text |Cite
|
Sign up to set email alerts
|

Efficient Nearest-Neighbor Search in the Probability Simplex

Abstract: Document similarity tasks arise in many areas of information retrieval and natural language processing. A fundamental question when comparing documents is which representation to use. Topic models, which have served as versatile tools for exploratory data analysis and visualization, represent documents as probability distributions over latent topics. Systems comparing topic distributions thus use measures of probability divergence such as KullbackLeibler, Jensen-Shannon, or Hellinger. This paper presents novel… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
8
0

Year Published

2014
2014
2020
2020

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 14 publications
(8 citation statements)
references
References 32 publications
0
8
0
Order By: Relevance
“…But only few methods handle density metrics in a simplex space. A first approach transformed the He divergence into an Euclidean distance so that existing ANN techniques, such as LSH and k-d tree, could be applied [25]. But this solution does not consider the special attributions of probability distributions, such as Non-negative and Sum-equal-one.…”
Section: Hashing Topic Distributionsmentioning
confidence: 99%
See 1 more Smart Citation
“…But only few methods handle density metrics in a simplex space. A first approach transformed the He divergence into an Euclidean distance so that existing ANN techniques, such as LSH and k-d tree, could be applied [25]. But this solution does not consider the special attributions of probability distributions, such as Non-negative and Sum-equal-one.…”
Section: Hashing Topic Distributionsmentioning
confidence: 99%
“…if the corpus size is equal to 1000 elements, only the top 5 most similar documents are considered relevant for a given document). This value has been considered after reviewing datasets used in similar experiments [25,31]. In those experiments, the reference data is obtained from existing categories, and the minimum average between corpus size and categorized documents is around 0.5%.…”
Section: Retrieving Similar Documentsmentioning
confidence: 99%
“…Jensen Divergence works with exponential kernel which is convex function, and it has a close relationship with Hellinger distance and Triangle Measures [29][30][31]. Jensen Divergence (Js ), Hellinger (He ) and Triangle (T) measure are given in Eqs.…”
Section: The Discrete Wavelet Transform For Feature Extractionmentioning
confidence: 99%
“…Therefore the relationship between Jensen-Shannon, Hellinger and Triangle measures [32] can be expressed by: …”
Section: The Discrete Wavelet Transform For Feature Extractionmentioning
confidence: 99%
“…Additionally, several hashing schemes have been defined for alternative measures of probability distributions such as the Chi-squared distance (12), and the Hellinger distance (13). Also, Mu and Yan (25) proposed a family of LSH functions for dealing with non-metric distances, but they considered a symmetric version of the KL divergence.…”
Section: Locality Sensitive Hashing For Related Distancesmentioning
confidence: 99%