Near Duplicate Text Detection Using Frequency-Biased Signatures

Sun, Yue; Qin, Jianbin; Wang, Wei

doi:10.1007/978-3-642-41230-1_24

Cited by 8 publications

(10 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The specific implementation we use considers only candidate windows of size w, and our overlap constraints are converted into corresponding equivalent Jaccard constraints. • FBW is a Winnowing-family algorithm [31] which returns approximate answers to the problem of finding documents that share w − q + 1 consecutive token q-grams while tolerating qτ errors, where q is the q-gram length. We use its fingerprinting scheme to generate candidates and they are verified against our similarity constraint.…”

Section: Experiments Setupmentioning

confidence: 99%

“…These replications are hardly detected by similarity search and join approaches since these methods measures the similarities of entire documents, which are relatively low when only a small part is replicated. Document fingerprinting approaches are also likely to miss these results because they are either susceptible to small modifications [25,6,8] or do not have any guarantee when detecting similar segments [30,29,18,31].…”

Section: Introductionmentioning

confidence: 99%

“…Request permissions from permissions@acm.org. page detection, identifying replications between documents has attracted remarkable attention from research community, and many approaches were proposed in the last two decades, e.g., by similarity search and join [27,10,3,4,35,33] or document fingerprinting [25,6,8,29,30,18,31]. For the body of work in similarity search and join, documents are regarded as (multi)sets of tokens or strings, and pairs of documents are identified if they satisfy a similarity constraint.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Local Similarity Search for Unstructured Text

Wang

Xiao

Qin

et al. 2016

Proceedings of the 2016 International Conference on Management of Data

Self Cite

View full text Add to dashboard Cite

With the growing popularity of electronic documents, replication can occur for many reasons. People may copy text segments from various sources and make modifications. In this paper, we study the problem of local similarity search to find partially replicated text. Unlike existing studies on similarity search which find entirely duplicated documents, our target is to identify documents that approximately share a pair of sliding windows which differ by no more than τ tokens. Our problem is technically challenging because for sliding windows the tokens to be indexed are less selective than entire documents, rendering set similarity join-based algorithms less efficient. Our proposed method is based on enumerating token combinations to obtain signatures with high selectivity. In order to strike a balance between signature and candidate generation, we partition the token universe and for different partitions we generate combinations composed of different numbers of tokens. A cost-aware algorithm is devised to find a good partitioning of the token universe. We also propose to leverage the overlap between adjacent windows to share computation and thus speed up query processing. In addition, we develop the techniques to support the large thresholds. Experiments on real datasets demonstrate the efficiency of our method against alternative solutions.

show abstract

Section: Experiments Setupmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Local Similarity Search for Unstructured Text

Wang

Xiao

Qin

et al. 2016

Proceedings of the 2016 International Conference on Management of Data

Self Cite

View full text Add to dashboard Cite

show abstract

“…Second, we also hoped to show that our method is more robust to a large sized document database. We also compared our method against a winnowing-based near-duplicate document search method [15], and evaluated them both based on their document-level accuracy. Finally, we conducted experiments to demonstrate the superiority of the newly proposed technique, which improves the performance of a genomic read-mapping model based document search method.…”

Section: Experiments Settingmentioning

confidence: 99%

“…Instead, we counted the number of fragments located in each document, and chose the top document with the greatest number of matches. Similarly, we generated a document signature according to [15] using the parameters q = 4 and w = 146, counted the number of shared signatures between the query and documents in the database, and returned the top results. As shown in Table 2, our method outperforms the existing winnowing-based method.…”

Section: Searching In Large Document Setmentioning

confidence: 99%

Fast and Flexible Text Search Using Genomic Short-Read Mapping Model

Kim

Cho

2016

ETRI J

View full text Add to dashboard Cite

The searching of an extensive document database for documents that are locally similar to a given query document, and the subsequent detection of similar regions between such documents, is considered as an essential task in the fields of information retrieval and data management. In this paper, we present a framework for such a task. The proposed framework employs the method of short-read mapping, which is used in bioinformatics to reveal similarities between genomic sequences. In this paper, documents are considered biological objects; consequently, edit operations between locally similar documents are viewed as an evolutionary process. Accordingly, we are able to apply the method of evolution tracing in the detection of similar regions between documents. In addition, we propose heuristic methods to address issues associated with the different stages of the proposed framework, for example, a frequency-based fragment ordering method and a locality-aware interval aggregation method. Extensive experiments covering various scenarios related to the search of an extensive document database for documents that are locally similar to a given query document are considered, and the results indicate that the proposed framework outperforms existing methods.

show abstract