2019
DOI: 10.48550/arxiv.1902.03402
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

A new simple and effective measure for bag-of-word inter-document similarity measurement

Sunil Aryal,
Kai Ming Ting,
Takashi Washio
et al.

Abstract: To measure the similarity of two documents in the bag-of-words (BoW) vector representation, different term weighting schemes are used to improve the performance of cosine similarity-the most widely used inter-document similarity measure in text mining. In this paper, we identify the shortcomings of the underlying assumptions of term weighting in the inter-document similarity measurement task; and provide a more fit-to-the-purpose alternative. Based on this new assumption, we introduce a new simple but effectiv… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2021
2021
2021
2021

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(1 citation statement)
references
References 12 publications
0
1
0
Order By: Relevance
“…For this task, we calculate similarities based on the Weighted Jaccard Index (WJI) [6], using the analogy with the task of finding weighted similarity between the text documents, as described in e.g. [4]. We encode recommendation lists as "bag-of-items" (BOI) vectors x = (π‘₯ 𝑖 ) 𝑁 𝑖=1 , where π‘₯ 𝑖 denotes the rank (position in the list) of item 𝑖 and 𝑁 is the size of entire item catalog.…”
Section: Estimating Recommendations Stabilitymentioning
confidence: 99%
“…For this task, we calculate similarities based on the Weighted Jaccard Index (WJI) [6], using the analogy with the task of finding weighted similarity between the text documents, as described in e.g. [4]. We encode recommendation lists as "bag-of-items" (BOI) vectors x = (π‘₯ 𝑖 ) 𝑁 𝑖=1 , where π‘₯ 𝑖 denotes the rank (position in the list) of item 𝑖 and 𝑁 is the size of entire item catalog.…”
Section: Estimating Recommendations Stabilitymentioning
confidence: 99%