Efficient similarity joins for near-duplicate detection

Xiao, Chuan; Wang, Wei; Lin, Xuemin; Yu, Jeffrey Xu; Wang, Guoren

doi:10.1145/2000824.2000825

Cited by 355 publications

(420 citation statements)

References 46 publications

Supporting

Mentioning

395

Contrasting

Unclassified

Order By: Relevance

“…In order to solve this problem, some duplicate detection systems used active learning techniques to automatically locate such ambiguous pairs. ALIAS [25] is learning based duplicate detection system which uses the idea of a "reject region" to significantly reduce the size of the training set. In [26] is used a similar strategy and employed decision trees to teach rules for matching records with multiple fields.…”

Section: Active-learning-based Approachesmentioning

confidence: 99%

See 1 more Smart Citation

Duplicate Literature Detection for Cross-Library Search

Liu

Jianxun

2016

Cybernetics and Information Technologies

View full text Add to dashboard Cite

Section: Active-learning-based Approachesmentioning

confidence: 99%

“…Until now, there are already great deals of research works [6][7][8][9][10][11] on duplicate detection. They try to map duplicates between two sources, which result in Cn 2 implementations of duplicate detectors towards n total sources.…”

Section: Introductionmentioning

confidence: 99%

Duplicate Literature Detection for Cross-Library Search

Liu

Jianxun

2016

Cybernetics and Information Technologies

View full text Add to dashboard Cite

“…study the problem on how to efficiently extract K pairs of records, which are most similar to each other. In [3,28,34,36,37,39], they focus on how to efficiently extract all records with record scores greater than a pre-specified threshold. -Method 2: Pre-specify a threshold for each individual attribute such that each record, whose attribute score over the corresponding attribute is not less than the pre-specified threshold, is regarded to refer to the same entity with the search query.…”

Section: Related Workmentioning

confidence: 99%

Approximate entity extraction in temporal databases

Lü

Fung

et al. 2011

World Wide Web

View full text Add to dashboard Cite

We study the problem of efficiently extracting K entities, in a temporal database, which are most similar to a given search query. This problem is well studied in relational databases, where each entity is represented as a single record and there exist a variety of methods to define the similarity between a record and the search query. However, in temporal databases, each entity is represented as a sequence of historical records. How to properly define the similarity of each entity in the temporal 158 World Wide Web (2011) 14:157-186 database still remains an open problem. The main challenging is that, when a user issues a search query for an entity, he or she is prone to mix up information of the same entity at different time points. As a result, methods, which are used in relational databases based on record granularity, cannot work any further. Instead, we regard each entity as a set of "virtual records", where attribute values of a "virtual record" can be from different records of the same entity. In this paper, we propose a novel evaluation model, based on which the similarity between each "virtual record" and the query can be effectively quantified, and the maximum similarity of its "virtual records" is taken as the similarity of an entity. For each entity, as the number of its "virtual records" is exponentially large, calculating the similarity of the entity is challenging. As a result, we further propose a Dominating Tree Algorithm (DTA), which is based on the bounding-pruning-refining strategy, to efficiently extract K entities with greatest similarities. We conduct extensive experiments on both real and synthetic datasets. The encouraging results show that our model for defining the similarity between each entity and the search query is effective, and the proposed DTA can perform at least two orders of magnitude improvement on the performance comparing with the naive approach.

show abstract

“…In essence, the objective of these methods is to identify these similar strings by scanning as less number of records as possible. In [1,5,13,21,23,24,34], the main approaches are based on the inverted indices and a variety of effective filtering techniques. In [1,5,24], they focus on how to skip strings as many as possible during the merging of inverted lists.…”

Section: Approximate String Searchmentioning

confidence: 99%

“…Defining the similarity of each record properly is required and necessary. In [23,24,31,[33][34][35], they concatenate attribute values of the same record into a single string and the similarity of each record is defined using a given similarity function. …”

Section: Introductionmentioning

confidence: 99%

Efficient top-K approximate searches against a relation with multiple attributes

Lü

Chen

et al. 2011

World Wide Web

View full text Add to dashboard Cite

In this paper, we study the problem of efficiently identifying K records that are most similar to a given query record, where the similarity is defined as:(1) for each record, we calculate the similarity score between the record and the query record over each individual attribute using a specific similarity function; (2) an aggregate function is utilized to combine these similarity scores with weights and the aggregated value is served as the similarity of the record. After similarities of all records have been computed, K records with the greatest similarities can further be identified. Under this framework, unfortunately, the computational cost will be extremely expensive when the cardinality of relation is large as computation of similarity for each record is required. As a result, in this paper, we propose two World Wide Web (2011) 14:573-597 efficient algorithms, named ScanIndex and Top-Down (TD for short), to cope with this problem. With respect to ScanIndex, similarity scores that are equal to zero over individual attributes are free from computation. Based on ScanIndex, with respect to TD, similarity scores less than thresholds (rather than zero) over individual attributes are skipped, where these thresholds are improved dynamically over time. Experimental results demonstrate that, comparing with the naive approach, the performance can be improved by two orders of magnitude using ScanIndex and TD.

show abstract

Efficient similarity joins for near-duplicate detection

Cited by 355 publications

References 46 publications

Duplicate Literature Detection for Cross-Library Search

Duplicate Literature Detection for Cross-Library Search

Approximate entity extraction in temporal databases

Efficient top-K approximate searches against a relation with multiple attributes

Contact Info

Product

Resources

About