Proceedings of the 22nd ACM International Conference on Information &Amp; Knowledge Management 2013
DOI: 10.1145/2505515.2505526
|View full text |Cite
|
Sign up to set email alerts
|

Effective measures for inter-document similarity

Abstract: While supervised learning-to-rank algorithms have largely supplanted unsupervised query-document similarity measures for search, the exploration of query-document measures by many researchers over many years produced insights that might be exploited in other domains. For example, the BM25 measure substantially and consistently outperforms cosine across many tested environments, and potentially provides retrieval effectiveness approaching that of the best learning-to-rank methods over equivalent features sets. … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
10
0
1

Year Published

2016
2016
2021
2021

Publication Types

Select...
3
3
1

Relationship

0
7

Authors

Journals

citations
Cited by 45 publications
(11 citation statements)
references
References 23 publications
0
10
0
1
Order By: Relevance
“…VSM was employed by many scholarly article retrieval systems (e.g., [17]). However, cosine similarity did not always perform well [18,19]. Latent Semantic Analysis (LSA) was a typical technique that employed singular value decomposition to improve the vector representation of scholarly articles [20,21].…”
Section: Content-based Similarity Measuresmentioning
confidence: 99%
See 3 more Smart Citations
“…VSM was employed by many scholarly article retrieval systems (e.g., [17]). However, cosine similarity did not always perform well [18,19]. Latent Semantic Analysis (LSA) was a typical technique that employed singular value decomposition to improve the vector representation of scholarly articles [20,21].…”
Section: Content-based Similarity Measuresmentioning
confidence: 99%
“…OK was developed based on BM25 [22], which was one of the best techniques for finding related scholarly articles [19]. It was shown to be one of the best techniques to identify related articles as well [18]. OK employs Equation (4) to estimate the similarity between two articles, a 1 and a 2 , where k 1 and b are two parameters, |a| is the number of terms in article a (i.e., length of a), avgal is the average number of terms in an article (i.e., average length of articles), TF(t,a) is the frequency of term t appearing in article a, and IDF(t) is the inverse document frequency of term t, which measures how rarely t appears in a large collection of articles.…”
Section: Content-based Similarity Measuresmentioning
confidence: 99%
See 2 more Smart Citations
“…Probably the most common text similarity measure is the cosine similarity of weighted term vectors, which has also been used frequently in the Web table recovery literature. Besides this measure, we also consider two alternative measures, proposed by (Whissell and Clarke, 2013), which represent symmetric variants of popular retrieval functions. In detail, we consider the following text similarity measures in our study: The cosine similarity Cosine TF-IDF represents a similarity measure frequently used in connection with the vector space model, which models texts or documents as vectors of term weights, one for each term in a dictionary (Salton and Buckley, 1988).…”
Section: Similarity Metricsmentioning
confidence: 99%