2013
DOI: 10.1007/978-3-642-41230-1_24
|View full text |Cite
|
Sign up to set email alerts
|

Near Duplicate Text Detection Using Frequency-Biased Signatures

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
10
0

Year Published

2014
2014
2020
2020

Publication Types

Select...
3
2
1

Relationship

1
5

Authors

Journals

citations
Cited by 8 publications
(10 citation statements)
references
References 21 publications
0
10
0
Order By: Relevance
“…The specific implementation we use considers only candidate windows of size w, and our overlap constraints are converted into corresponding equivalent Jaccard constraints. • FBW is a Winnowing-family algorithm [31] which returns approximate answers to the problem of finding documents that share w − q + 1 consecutive token q-grams while tolerating qτ errors, where q is the q-gram length. We use its fingerprinting scheme to generate candidates and they are verified against our similarity constraint.…”
Section: Experiments Setupmentioning
confidence: 99%
See 2 more Smart Citations
“…The specific implementation we use considers only candidate windows of size w, and our overlap constraints are converted into corresponding equivalent Jaccard constraints. • FBW is a Winnowing-family algorithm [31] which returns approximate answers to the problem of finding documents that share w − q + 1 consecutive token q-grams while tolerating qτ errors, where q is the q-gram length. We use its fingerprinting scheme to generate candidates and they are verified against our similarity constraint.…”
Section: Experiments Setupmentioning
confidence: 99%
“…These replications are hardly detected by similarity search and join approaches since these methods measures the similarities of entire documents, which are relatively low when only a small part is replicated. Document fingerprinting approaches are also likely to miss these results because they are either susceptible to small modifications [25,6,8] or do not have any guarantee when detecting similar segments [30,29,18,31].…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…Second, we also hoped to show that our method is more robust to a large sized document database. We also compared our method against a winnowing-based near-duplicate document search method [15], and evaluated them both based on their document-level accuracy. Finally, we conducted experiments to demonstrate the superiority of the newly proposed technique, which improves the performance of a genomic read-mapping model based document search method.…”
Section: Experiments Settingmentioning
confidence: 99%
“…Instead, we counted the number of fragments located in each document, and chose the top document with the greatest number of matches. Similarly, we generated a document signature according to [15] using the parameters q = 4 and w = 146, counted the number of shared signatures between the query and documents in the database, and returned the top results. As shown in Table 2, our method outperforms the existing winnowing-based method.…”
Section: Searching In Large Document Setmentioning
confidence: 99%