A Primitive Operator for Similarity Joins in Data Cleaning

Chaudhuri, Surajit; Ganti, Venkatesh; Kaushik, Raghav

doi:10.1109/icde.2006.9

Cited by 478 publications

(547 citation statements)

References 9 publications

Supporting

Mentioning

532

Contrasting

Unclassified

Order By: Relevance

“…It is too expensive for practical use because the value of S can be large; e.g., there are 35932 sentences in the 200 presentation files used in our experiment. Since the problem is exactly the set similarity join problem [4] and has been studied by the database research community, the ppjoin algorithm is employed [5] to efficiently find the pairs of sentences that satisfy the constraint. Its basic idea is to sort the words in each bag according to a global order and exploit the threshold t. If a pair of sentences satisfy the similarity constraint, they must share at least one word in their first p words, where p =⌊max(lx, ly)· (1-t)⌋+1, and lx and ly denote the numbers of words in x and y, respectively [5].…”

Section: A Detecting Reused Textual Elementsmentioning

confidence: 99%

Managing Presentation Slides with Reused Elements

Zhang¹,

Xiao²,

Hu³

et al. 2016

IJIET

View full text Add to dashboard Cite

Abstract-Slide presentations have become a ubiquitous tool for business and educational purposes. Instead of starting from scratch, slide composers tend to make new presentation slides by reusing materials from existing slides. Understanding how slide elements are copied from one presentation file to another and how presentation files are related to each other are difficult tasks. In this paper, it is investigated the management of multiple presentation files based on reused slide elements. Techniques are developed to detect textual and visual elements that have been reused across multiple presentation files. Interactive visualization methods are proposed to facilitate understanding the process by which these elements are reused and the relationship between the files that use them. A system with a user-friendly interface is designed, based on which experiments are performed to evaluate the effectiveness of the proposed methods.Index Terms-Presentation slide management, slide element reuse, slide element visualization.

show abstract

Section: A Detecting Reused Textual Elementsmentioning

confidence: 99%

Managing Presentation Slides with Reused Elements

Zhang¹,

Xiao²,

Hu³

et al. 2016

IJIET

View full text Add to dashboard Cite

show abstract

“…To efficiently access these attribute values, further partitioning store techniques are studied. Chaudhuri et al [10] study a similarity join operator (SSJoin [5,10]) on text attributes, which are also organized in a vertical style. Specifically, each value of text attributes is converted to a set of tokens (words or q-grams [34]), which are store separately in different tuples respectively, similar to the attribute partitioning.…”

Section: Related Workmentioning

confidence: 99%

Indexing dataspaces with partitions

Song

Chen

2012

World Wide Web

View full text Add to dashboard Cite

Dataspaces are recently proposed to manage heterogeneous data, with features like partially unstructured, high dimension and extremely sparse. The inverted index has been previously extended to retrieve dataspaces. In order to achieve more efficient access to dataspaces, in this paper, we first introduce our survey of data features in the real dataspaces. Based on the features observed in our study, several partitioning based index approaches are proposed to accelerate the query processing in dataspaces. Specifically, the vertical partitioning index utilizes the partitions on tokens to merge and compress data. We can both reduce the number of I/O reads and avoid aggregation of data inside a compressed list. The horizontal partitioning index supports pruning partitions of tuples in the top-k query. Thus, we can reduce the computation overhead of irrelevant candidate tuples to the query. Finally, we also propose a hybrid index with both vertical and horizontal partitioning. The extensive experiment results in real data sets demonstrate that our approaches outperform the previous techniques and scale well with the large data size.

show abstract

“…Data mining methods initially designed to efficiently search databases [26] or the Web [27] were later adapted to solve the APSS problem [28]. Most of the existing work addresses either binary vector object representations [29][30][31] or cosine similarity [32,33].…”

Section: Introductionmentioning

confidence: 99%

Efficient identification of Tanimoto nearest neighbors

Anastasiu

Karypis

2017

Int J Data Sci Anal

View full text Add to dashboard Cite

Tanimoto, or extended Jaccard, is an important similarity measure which has seen prominent use in fields such as data mining and chemoinformatics. Many of the existing state-of-the-art methods for market basket analysis, plagiarism and anomaly detection, compound database search, and ligand-based virtual screening rely heavily on identifying Tanimoto nearest neighbors. Given the rapidly increasing size of data that must be analyzed, new algorithms are needed that can speed up nearest neighbor search, while at the same time providing reliable results. While many search algorithms address the complexity of the task by retrieving only some of the nearest neighbors, we propose a method that finds all of the exact nearest neighbors efficiently by leveraging recent advances in similarity search filtering. We provide tighter filtering bounds for the Tanimoto coefficient and show that our method, TAPNN, greatly outperforms existing base-

show abstract

A Primitive Operator for Similarity Joins in Data Cleaning

Cited by 478 publications

References 9 publications

Managing Presentation Slides with Reused Elements

Managing Presentation Slides with Reused Elements

Indexing dataspaces with partitions

Efficient identification of Tanimoto nearest neighbors

Contact Info

Product

Resources

About