22nd International Conference on Data Engineering (ICDE'06) 2006
DOI: 10.1109/icde.2006.9
|View full text |Cite
|
Sign up to set email alerts
|

A Primitive Operator for Similarity Joins in Data Cleaning

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
532
1
14

Year Published

2009
2009
2019
2019

Publication Types

Select...
7
2

Relationship

0
9

Authors

Journals

citations
Cited by 478 publications
(547 citation statements)
references
References 9 publications
0
532
1
14
Order By: Relevance
“…It is too expensive for practical use because the value of S can be large; e.g., there are 35932 sentences in the 200 presentation files used in our experiment. Since the problem is exactly the set similarity join problem [4] and has been studied by the database research community, the ppjoin algorithm is employed [5] to efficiently find the pairs of sentences that satisfy the constraint. Its basic idea is to sort the words in each bag according to a global order and exploit the threshold t. If a pair of sentences satisfy the similarity constraint, they must share at least one word in their first p words, where p =⌊max(lx, ly)· (1-t)⌋+1, and lx and ly denote the numbers of words in x and y, respectively [5].…”
Section: A Detecting Reused Textual Elementsmentioning
confidence: 99%
“…It is too expensive for practical use because the value of S can be large; e.g., there are 35932 sentences in the 200 presentation files used in our experiment. Since the problem is exactly the set similarity join problem [4] and has been studied by the database research community, the ppjoin algorithm is employed [5] to efficiently find the pairs of sentences that satisfy the constraint. Its basic idea is to sort the words in each bag according to a global order and exploit the threshold t. If a pair of sentences satisfy the similarity constraint, they must share at least one word in their first p words, where p =⌊max(lx, ly)· (1-t)⌋+1, and lx and ly denote the numbers of words in x and y, respectively [5].…”
Section: A Detecting Reused Textual Elementsmentioning
confidence: 99%
“…To efficiently access these attribute values, further partitioning store techniques are studied. Chaudhuri et al [10] study a similarity join operator (SSJoin [5,10]) on text attributes, which are also organized in a vertical style. Specifically, each value of text attributes is converted to a set of tokens (words or q-grams [34]), which are store separately in different tuples respectively, similar to the attribute partitioning.…”
Section: Related Workmentioning
confidence: 99%
“…Data mining methods initially designed to efficiently search databases [26] or the Web [27] were later adapted to solve the APSS problem [28]. Most of the existing work addresses either binary vector object representations [29][30][31] or cosine similarity [32,33].…”
Section: Introductionmentioning
confidence: 99%