Improved robustness of signature-based near-replica detection via lexicon randomization

Kołcz, Aleksander; Chowdhury, Abdur; Alspector, Joshua

doi:10.1145/1014052.1014127

Cited by 56 publications

(40 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Specifically, every document is compared to all others in the dataset and the similarity between each pair is calculated [5].…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Fingerprint-Based Near-Duplicate Document Detection with Applications to SNS Spam Detection

Kim

2014

International Journal of Distributed Sensor Networks

View full text Add to dashboard Cite

Social networking has been used widely by millions of people over the world. It has become the most popular way for people who want to connect and interact online with their friends. Currently, there are many social networking sites, for instance, Facebook, My Space, and Twitter, with a huge number of active users. Therefore, they are also good places for spammers or cheaters who want to steal the personal information of users or advertise their products. Recently, many proposed methods are applied to detect spam comments on social networks with different techniques. In this paper, we propose a similarity-based method that combines fingerprinting technique with trie-tree data structure and meet-in-the-middle approach in order to achieve a higher accuracy in spam comments detection. Using our proposed approach, we are able to detect around 98% spam comments in our dataset.

show abstract

“…Specifically, every document is compared to all others in the dataset and the similarity between each pair is calculated [5].…”

Section: Related Workmentioning

confidence: 99%

“…In one example in the fingerprinting context, some frequently occurring shingles are eliminated [5]. In this study, the shingles technique considers a document as a stream of tokens, which is broken into overlapping or nonoverlapping segments referred to as shingles.…”

Section: Related Workmentioning

confidence: 99%

Fingerprint-Based Near-Duplicate Document Detection with Applications to SNS Spam Detection

Kim

2014

International Journal of Distributed Sensor Networks

View full text Add to dashboard Cite

show abstract

“…The JRC sub-corpus amounts to 10,000 documents for each language, PAN sub-corpus contains 2920 enes and 2222 en-de document pairs and Wiki sub-corpus contains 10,000 documents for each language. The partitions of the JRC-Acquis and Wikipedia sub-collections used in the experiments are publicly available 4 . Our complete test collection includes 70,282 documents.…”

Section: Datasetsmentioning

confidence: 99%

“…The former refers to the technology of duplicate identification for Web search indexing, also known as near-duplicate detection; whereas the latter corresponds to high similarity search for text classification, document clustering, plagiarism detection and retrieval by example. This problem is well studied for the monolingual variant and the most popular approaches are related to shingling [1], and the majority of research is based on the selection of a representative signature for the documents in question [2][3][4].…”

Section: Introductionmentioning

confidence: 99%

Cross-Language High Similarity Search Using a Conceptual Thesaurus

Gupta

Barrón-Cedeño

Rosso

2012

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. This work addresses the issue of cross-language high similarity and near-duplicates search, where, for the given document, a highly similar one is to be identified from a large cross-language collection of documents. We propose a concept-based similarity model for the problem which is very light in computation and memory. We evaluate the model on three corpora of different nature and two language pairs English-German and English-Spanish using the Eurovoc conceptual thesaurus. Our model is compared with two state-of-the-art models and we find, though the proposed model is very generic, it produces competitive results and is significantly stable and consistent across the corpora.

show abstract

“…Detecting and eliminating replicated documents is recognized as one of the central problems for search engines [14], [9], [17], [47]. Notice that this is a typical situation of the separation effect: Almost all distances in the range [ , 1], since 50% similarity is considered to be threshold for duplicates and thus all "original" documents have over one half distances between each other.…”

Section: Combinatorial Algorithmsmentioning

confidence: 99%

Combinatorial Framework for Similarity Search

Lifshits

2009

2009 Second International Workshop on Similarity Search and Applications

View full text Add to dashboard Cite

Abstract-We present an overview of the combinatorial framework for similarity search. An algorithm is combinatorial if only direct comparisons between two pairwise similarity values are allowed. Namely, the input dataset is represented by a comparison oracle that given any three points x, y, z answers whether y or z is closer to x. We assume that the similarity order of the dataset satisfies the four variations of the following disorder inequality: if x is the a'th most similar object to y and y is the b'th most similar object to z, then x is among the D(a + b) most similar objects to z, where D is a relatively small disorder constant. Combinatorial algorithms for nearest neighbor search have two important advantages: (1) they do not map similarity values to artificial distance values and do not use triangle inequality for the latter, and (2) they work for arbitrarily complicated data representations and similarity functions.Ranwalk, the first known combinatorial solution for nearest neighbors, is randomized, exact, zero-error algorithm with query time that is logarithmic in number of objects. But Ranwalk preprocessing time is quadratic. Later on, another solution, called combinatorial nets, was discovered. It is deterministic and exact algorithm with near-linear time and space complexity of preprocessing, and near-logarithmic time complexity of search. Combinatorial nets also have a number of side applications. For near-duplicate detection they lead to the first known deterministic algorithm that requires just nearlinear time + time proportional to the size of output. For any dataset with small disorder combinatorial nets can be used to construct a visibility graph: the one in which greedy routing deterministically converges to the nearest neighbor of a target in logarithmic number of steps. The later result is the first known work-around for Navarro's impossibility of generalizing Delaunay graphs.

show abstract

Improved robustness of signature-based near-replica detection via lexicon randomization

Cited by 56 publications

References 12 publications

Fingerprint-Based Near-Duplicate Document Detection with Applications to SNS Spam Detection

Fingerprint-Based Near-Duplicate Document Detection with Applications to SNS Spam Detection

Cross-Language High Similarity Search Using a Conceptual Thesaurus

Combinatorial Framework for Similarity Search

Contact Info

Product

Resources

About