Pictures of relevance: A geometric analysis of similarity measures

Jones, William; Furnas, George W.

doi:10.1002/(sici)1097-4571(198711)38:6<420::aid-asi3>3.0.co;2-s

Cited by 201 publications

(90 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The advantage of the cosine being not a statistic but a similarity measure then disappears. Formally, these two measures are equivalent, with the exception that Pearson normalizes for the arithmetic mean while the cosine does not use this mean as a parameter (Jones & Furnas, 1987). The cosine normalizes for the geometrical mean.…”

Section: Introductionmentioning

confidence: 99%

On the normalization and visualization of author co‐citation data: Salton's Cosine versus the Jaccard index

Leydesdorff

2007

J. Am. Soc. Inf. Sci.

190

130

View full text Add to dashboard Cite

The debate about which similarity measure one should use for the normalization in the case of Author Co-citation Analysis (ACA) is further complicated when one distinguishes between the symmetrical co-citation-or, more generally, co-occurrence-matrix and the underlying asymmetrical citation-occurrence-matrix. In the Web environment, the approach of retrieving original citation data is often not feasible. In that case, one should use the Jaccard index, but preferentially after adding the number of total citations (i.e., occurrences) on the main diagonal. Unlike Salton's cosine and the Pearson correlation, the Jaccard index abstracts from the shape of the distributions and focuses only on the intersection and the sum of the two sets. Since the correlations in the co-occurrence matrix may be spurious, this property of the Jaccard index can be considered as an advantage in this case.

show abstract

Section: Introductionmentioning

confidence: 99%

On the normalization and visualization of author co‐citation data: Salton's Cosine versus the Jaccard index

Leydesdorff

2007

J. Am. Soc. Inf. Sci.

190

130

View full text Add to dashboard Cite

show abstract

“…Van Rijsbergen, 1979;Jones & Furnas, 1987;Ellis et al, 1993;Rorvig, 1999), and the choice of a specific measure may influence the outcome of the calculations. Van Rijsbergen, (1979), advised against the use of any measure that is not normalised by the length of the document vectors, something that was experimentally verified by Willett (1983).…”

Section: Introductionmentioning

confidence: 99%

Query-Sensitive Similarity Measures for Information Retrieval

Tombros

Rijsbergen

2004

Knowledge and Information Systems

View full text Add to dashboard Cite

Abstract. The application of document clustering to information retrieval has been motivated by the potential effectiveness gains postulated by the cluster hypothesis. The hypothesis states that relevant documents tend to be highly similar to each other, and therefore tend to appear in the same clusters. In this paper we propose an axiomatic view of the hypothesis, by suggesting that documents relevant to the same query (co-relevant documents) display an inherent similarity to each other which is dictated by the query itself. Because of this inherent similarity, the cluster hypothesis should be valid for any document collection. Our research describes an attempt to devise means by which this similarity can be detected. We propose the use of query-sensitive similarity measures that bias interdocument relationships towards pairs of documents that jointly possess attributes that are expressed in a query. We experimentally tested three query-sensitive measures against conventional ones that do not take the context of the query into account, and we also examined the comparative effectiveness of the three query-sensitive measures. We calculated interdocument relationships for varying numbers of top-ranked documents for six document collections. Our results show a consistent and significant increase in the number of relevant documents that become nearest neighbours of any given relevant document when query-sensitive measures are used. These results suggest that the effectiveness of a cluster-based IR system has the potential to increase through the use of query-sensitive similarity measures.

show abstract

“…These measures are useful to compute the neighborhood of a point and neighborhood-based measures but not for calculating similarity between a pair of data instances. In the area of information retrieval, Jones et al [12] and Noreault et. al [13] have studied several similarity measures.…”

Section: Related Workmentioning

confidence: 99%

DISC: Data-Intensive Similarity Measure for Categorical Data

Desai

Singh

Pudi

2011

Advances in Knowledge Discovery and Data Mining

View full text Add to dashboard Cite

Abstract. The concept of similarity is fundamentally important in almost every scientific field. Clustering, distance-based outlier detection, classification, regression and search are major data mining techniques which compute the similarities between instances and hence the choice of a particular similarity measure can turn out to be a major cause of success or failure of the algorithm. The notion of similarity or distance for categorical data is not as straightforward as for continuous data and hence, is a major challenge. This is due to the fact that different values taken by a categorical attribute are not inherently ordered and hence a notion of direct comparison between two categorical values is not possible. In addition, the notion of similarity can differ depending on the particular domain, dataset, or task at hand. In this paper we present a new similarity measure for categorical data DISC -Data-Intensive Similarity Measure for Categorical Data. DISC captures the semantics of the data without any help from domain expert for defining the similarity. In addition to these, it is generic and simple to implement. These desirable features make it a very attractive alternative to existing approaches. Our experimental study compares it with 14 other similarity measures on 24 standard real datasets, out of which 12 are used for classification and 12 for regression, and shows that it is more accurate than all its competitors.

show abstract

Pictures of relevance: A geometric analysis of similarity measures

Cited by 201 publications

References 13 publications

On the normalization and visualization of author co‐citation data: Salton's Cosine versus the Jaccard index

On the normalization and visualization of author co‐citation data: Salton's Cosine versus the Jaccard index

Query-Sensitive Similarity Measures for Information Retrieval

DISC: Data-Intensive Similarity Measure for Categorical Data

Contact Info

Product

Resources

About