2004
DOI: 10.1007/978-3-540-30213-1_6
|View full text |Cite
|
Sign up to set email alerts
|

A Scalable System for Identifying Co-derivative Documents

Abstract: Abstract. Documents are co-derivative if they share content: for two documents to be co-derived, some portion of one must be derived from the other or some portion of both must be derived from a third document. The current technique for concurrently detecting all co-derivatives in a collection is document fingerprinting, which matches documents based on the hash values of selected document subsequences, or chunks. Fingerprinting is currently hampered by an inability to accurately isolate information that is us… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
47
0
1

Year Published

2005
2005
2020
2020

Publication Types

Select...
7
2

Relationship

1
8

Authors

Journals

citations
Cited by 44 publications
(48 citation statements)
references
References 14 publications
0
47
0
1
Order By: Relevance
“…One of the most popular downsampling techniques is 0 mod p [12]. When indexing a document, this technique hashes each n-gram to an integer and then determines whether the hash value is divisible by some integer p. Since instances of the same n-gram in all documents hash to the same value, the algorithm does not need to maintain a dictionary of the downsampled features.…”
Section: Downsampling Document Featuresmentioning
confidence: 99%
“…One of the most popular downsampling techniques is 0 mod p [12]. When indexing a document, this technique hashes each n-gram to an integer and then determines whether the hash value is divisible by some integer p. Since instances of the same n-gram in all documents hash to the same value, the algorithm does not need to maintain a dictionary of the downsampled features.…”
Section: Downsampling Document Featuresmentioning
confidence: 99%
“…Similar work, in a mono-lingual environment, involves the identification of redundant [4] and co-derivative [3] documents, using fingerprinting techniques. Fingerprints are compact representations of text chunks.…”
Section: Related Workmentioning
confidence: 99%
“…To see the effect of this fact, we investigated two other ways to estimate sentence length and used them instead of the default method, which was number of tokens. One is sum of the term frequency in the document for each term in the sentence 2 and the other one, the sum of their selectivity (inverse sentence frequency) 3 . Both methods produced different results for all the runs, however, they were most of the times slightly worse than the number of tokens, and in general the differences were negligible.…”
Section: Text Fragment Alignment Evaluationmentioning
confidence: 99%
“…rare chunks (Heintze (1996)) c occurs once in D SPEX (Bernstein and Zobel (2004) shingling (Broder (2000)) c ∈ {c1, . .…”
Section: Dimensionality Reduction By Embeddingmentioning
confidence: 99%