Pairwise-adaptive dissimilarity measure for document clustering

D’hondt, Joris; Vertommen, Joris; Verhaegen, Paul-Armand; Cattrysse, Dirk; Duflou, Joost R.

doi:10.1016/j.ins.2010.02.021

Cited by 30 publications

(12 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, the function quickly approaches an asymptote, limiting the impact of a single term. Although document retrieval and clustering are not identical tasks, there is now enough clustering research to suggest BM25 might aid in document clustering (Bashier and Rauber 2009;de Vries and Geva 2008;Whissell et al 2009;D'hondt et al 2010;Kutty et al, 2010). This, coupled with the fact that no thorough analysis on the specific benefits of Diff is the improvement in using the best binary algorithm over the best tf algorithm BM25 in document clustering exists, led us to use BM25 in a clustering experiment similar to our initial experiment discussed Sect.…”

Section: Bm25 Based Feature Weightingmentioning

confidence: 99%

“…A novel contribution of this paper is our investigation of Okapi BM25 (BM25) feature weighting. Only recently has BM25 been seriously considered in document clustering (de Vries and Geva 2008;Bashier and Rauber 2009;Whissell et al 2009;D'hondt et al 2010;Kutty et al 2010); with works that do use BM25 still being a small minority. Bashier and Rauber (2009) investigate relevance feedback using clustering.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Improving document clustering using Okapi BM25 feature weighting

Whissell

Clarke

2011

Inf Retrieval

View full text Add to dashboard Cite

We investigate the effect of feature weighting on document clustering, including a novel investigation of Okapi BM25 feature weighting. Using eight document datasets and 17 well-established clustering algorithms we show that the benefit of tf-idf weighting over tf weighting is heavily dependent on both the dataset being clustered and the algorithm used. In addition, binary weighting is shown to be consistently inferior to both tf-idf weighting and tf weighting. We investigate clustering using both BM25 term saturation in isolation and BM25 term saturation with idf, confirming that both are superior to their non-BM25 counterparts under several common clustering quality measures. Finally, we investigate estimation of the k1 BM25 parameter when clustering. Our results indicate that typical values of k1 from other IR tasks are not appropriate for clustering; k1 needs to be higher.keywords Document clustering Á Feature weighting Á Okapi BM25

show abstract

Section: Bm25 Based Feature Weightingmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Improving document clustering using Okapi BM25 feature weighting

Whissell

Clarke

2011

Inf Retrieval

View full text Add to dashboard Cite

show abstract

“…A term or feature can be a single word, multiple words, a phrase 1 or other indexing units [9,10]. The weight of a term represents the importance of it in the relevant document and is assigned by a term weighting scheme [11]. Term frequency (tf ) [3], inverse document frequency (idf ) [12], or multiplication of tf and idf (tf-idf ) [13][14][15] are commonly used term weighting schemes.…”

Section: Page 2 Of 23mentioning

confidence: 99%

“…For instance, Euclidean distance is a geometric measure used to measure the distance between two vectors [18,19]. Cosine similarity compares two documents with respect to the angle between their vectors [11]. Similar to two previous measures, Manhattan distance is also a geometric measure [20,21].…”

Section: Page 2 Of 23mentioning

confidence: 99%

Pairwise document similarity measure based on present term set

Oghbaie

Zanjireh

2018

J Big Data

View full text Add to dashboard Cite

IntroductionIn text mining, a similarity (or distance) measure is the quintessential way to calculate the similarity between two text documents, and is widely used in various Machine Learning (ML) methods, including clustering and classification. ML methods help learn from enormous collections, known as big data [1,2]. In big data, which includes masses of unstructured data, Information Retrieval (IR) is the dominant form of information access [3]. Among ML methods, classification and clustering help discover patterns and correlations and extract information from large-scale collections [1]. These two techniques also offer benefits to different IR applications. For example, document clustering can be applied to the document collection to improve search speed, precision, and recall or to the search results to provide more effective information presentation to user [3]. Document classification is also used in vertical search engines [4] and sentiment detection [5].In large-scale collections, one of the challenging issues is to identify documents with high similarity values, known as near-duplicate documents (or near-duplicates) [6][7][8].Integration of heterogeneous collections, storing multiple copies of the same document, and plagiarism are the main causes for the existence of near-duplicates. These documents increase processing overheads and storage. Detecting and filtering near-duplicates AbstractMeasuring pairwise document similarity is an essential operation in various text mining tasks. Most of the similarity measures judge the similarity between two documents based on the term weights and the information content that two documents share in common. However, they are insufficient when there exist several documents with an identical degree of similarity to a particular document. This paper introduces a novel text document similarity measure based on the term weights and the number of terms appeared in at least one of the two documents. The effectiveness of our measure is evaluated on two real-world document collections for a variety of text mining tasks, such as text document classification, clustering, and near-duplicates detection. The performance of our measure is compared with that of some popular measures. The experimental results showed that our proposed similarity measure yields more accurate results.

show abstract

“…Pairwise-adaptive similarity dynamically select number of features prior to every similarity measurement. Based on this method a relevant subset of terms is selected that will contribute to the measured distance between both related vectors [30].…”

Section: Related Workmentioning

confidence: 99%