Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2004
DOI: 10.1145/1014052.1014127
|View full text |Cite
|
Sign up to set email alerts
|

Improved robustness of signature-based near-replica detection via lexicon randomization

Abstract: Ghwhfwlrq ri qhdu gxsolfdwh grfxphqwv lv dq lpsruwdqw sure0 ohp lq pdq| gdwd plqlqj dqg lqirupdwlrq owhulqj dssolfd0 wlrqv1 Zkhq idfhg zlwk pdvvlyh txdqwlwlhv ri gdwd/ wudgl0 wlrqdo gxsolfdwh ghwhfwlrq whfkqltxhv uho|lqj rq gluhfw lqwhu0 grfxphqw vlplodulw| frpsxwdwlrq +h1j1/ xvlqj wkh frvlqh phd0 vxuh, duh riwhq qrw ihdvleoh jlyhq wkh wlph dqg phpru| shu0 irupdqfh frqvwudlqwv1 Rq wkh rwkhu kdqg/ qjhusulqw0edvhg phwkrgv/ vxfk dv L0Pdwfk/ duh yhu| dwwudfwlyh frpsxwd0 wlrqdoo| exw pd| eh eulwwoh zlwk uhvshfw wr … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
38
0
2

Year Published

2006
2006
2014
2014

Publication Types

Select...
4
2
2

Relationship

0
8

Authors

Journals

citations
Cited by 56 publications
(40 citation statements)
references
References 12 publications
0
38
0
2
Order By: Relevance
“…Specifically, every document is compared to all others in the dataset and the similarity between each pair is calculated [5].…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Specifically, every document is compared to all others in the dataset and the similarity between each pair is calculated [5].…”
Section: Related Workmentioning
confidence: 99%
“…In one example in the fingerprinting context, some frequently occurring shingles are eliminated [5]. In this study, the shingles technique considers a document as a stream of tokens, which is broken into overlapping or nonoverlapping segments referred to as shingles.…”
Section: Related Workmentioning
confidence: 99%
“…The JRC sub-corpus amounts to 10,000 documents for each language, PAN sub-corpus contains 2920 enes and 2222 en-de document pairs and Wiki sub-corpus contains 10,000 documents for each language. The partitions of the JRC-Acquis and Wikipedia sub-collections used in the experiments are publicly available 4 . Our complete test collection includes 70,282 documents.…”
Section: Datasetsmentioning
confidence: 99%
“…The former refers to the technology of duplicate identification for Web search indexing, also known as near-duplicate detection; whereas the latter corresponds to high similarity search for text classification, document clustering, plagiarism detection and retrieval by example. This problem is well studied for the monolingual variant and the most popular approaches are related to shingling [1], and the majority of research is based on the selection of a representative signature for the documents in question [2][3][4].…”
Section: Introductionmentioning
confidence: 99%
“…Detecting and eliminating replicated documents is recognized as one of the central problems for search engines [14], [9], [17], [47]. Notice that this is a typical situation of the separation effect: Almost all distances in the range [ , 1], since 50% similarity is considered to be threshold for duplicates and thus all "original" documents have over one half distances between each other.…”
Section: Combinatorial Algorithmsmentioning
confidence: 99%