A Scalable System for Identifying Co-derivative Documents

Bernstein, Yaniv; Zobel, Justin

doi:10.1007/978-3-540-30213-1_6

Cited by 44 publications

(48 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…One of the most popular downsampling techniques is 0 mod p [12]. When indexing a document, this technique hashes each n-gram to an integer and then determines whether the hash value is divisible by some integer p. Since instances of the same n-gram in all documents hash to the same value, the algorithm does not need to maintain a dictionary of the downsampled features.…”

Section: Downsampling Document Featuresmentioning

confidence: 99%

Detecting and modeling local text reuse

Smith

Cordel

Dillon

et al. 2014

IEEE/ACM Joint Conference on Digital Libraries

View full text Add to dashboard Cite

Texts propagate through many social networks and provide evidence for their structure. We describe and evaluate efficient algorithms for detecting clusters of reused passages embedded within longer documents in large collections. We apply these techniques to two case studies: analyzing the culture of free reprinting in the nineteenth-century United States and the development of bills into legislation in the U.S. Congress. Using these divergent case studies, we evaluate both the efficiency of the approximate local text reuse detection methods and the accuracy of the results. These techniques allow us to explore how ideas spread, which ideas spread, and which subgroups shared ideas.

show abstract

Section: Downsampling Document Featuresmentioning

confidence: 99%

Detecting and modeling local text reuse

Smith

Cordel

Dillon

et al. 2014

IEEE/ACM Joint Conference on Digital Libraries

View full text Add to dashboard Cite

show abstract

“…Similar work, in a mono-lingual environment, involves the identification of redundant [4] and co-derivative [3] documents, using fingerprinting techniques. Fingerprints are compact representations of text chunks.…”

Section: Related Workmentioning

confidence: 99%

“…To see the effect of this fact, we investigated two other ways to estimate sentence length and used them instead of the default method, which was number of tokens. One is sum of the term frequency in the document for each term in the sentence 2 and the other one, the sum of their selectivity (inverse sentence frequency) 3 . Both methods produced different results for all the runs, however, they were most of the times slightly worse than the number of tokens, and in general the differences were negligible.…”

Section: Text Fragment Alignment Evaluationmentioning

confidence: 99%

Cross-Lingual Text Fragment Alignment Using Divergence from Randomness

Yahyaei

Bonzanini

Roelleke

2011

String Processing and Information Retrieval

View full text Add to dashboard Cite

Abstract. This paper describes an approach to automatically align fragments of texts of two documents in different languages. A text fragment is a list of continuous sentences and an aligned pair of fragments consists of two fragments in two documents, which are content-wise related. Cross-lingual similarity between fragments of texts is estimated based on models of divergence from randomness. A set of aligned fragments based on the similarity scores are selected to provide an alignment between sections of the two documents. Similarity measures based on divergence show strong performance in the context of cross-lingual fragment alignment in the performed experiments.

show abstract

“…rare chunks (Heintze (1996)) c occurs once in D SPEX (Bernstein and Zobel (2004) shingling (Broder (2000)) c ∈ {c1, . .…”

Section: Dimensionality Reduction By Embeddingmentioning

confidence: 99%

New Issues in Near-duplicate Detection

Potthast

Stein

2008

Data Analysis, Machine Learning and Applications

View full text Add to dashboard Cite

Near-duplicate detection is the task of identifying documents with almost identical content. The respective algorithms are based on fingerprinting; they have attracted considerable attention due to their practical significance for Web retrieval systems, plagiarism analysis, corporate storage maintenance, or social collaboration and interaction in the World Wide Web.Our paper presents both an integrative view as well as new aspects from the field of near-duplicate detection: (i) Principles and Taxonomy. Identification and discussion of the principles behind the known algorithms for near-duplicate detection. (ii) Corpus Linguistics. Presentation of a corpus that is specifically suited for the analysis and evaluation of near-duplicate detection algorithms. The corpus is public and may serve as a starting point for a standardized collection in this field. (iii) Analysis and Evaluation. Comparison of state-of-the-art algorithms for nearduplicate detection with respect to their retrieval properties. This analysis goes beyond existing surveys and includes recent developments from the field of hash-based search.

show abstract

A Scalable System for Identifying Co-derivative Documents

Cited by 44 publications

References 14 publications

Detecting and modeling local text reuse

Detecting and modeling local text reuse

Cross-Lingual Text Fragment Alignment Using Divergence from Randomness

New Issues in Near-duplicate Detection

Contact Info

Product

Resources

About