Hashing tree-structured data: Methods and applications

Tatikonda, Shirish; Parthasarathy, S.

doi:10.1109/icde.2010.5447882

Cited by 23 publications

(24 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Tatikonda and Parthasarathy [34] introduce embedded pivots for computing the distance between unordered trees. An embedded pivot consists of two nodes and their least common ancestor, unless the least common ancestor is one of the two nodes.…”

Section: Related Workmentioning

confidence: 99%

“…Next, we show the scalability of windowed pq-grams (w = 3, p = 1, q = 2) and compare them to embedded pivots [34]. Embedded pivots are snippets that consist of two nodes and their least common ancestor (cf.…”

Section: Scalability Of Profile and Index Computationmentioning

confidence: 99%

“…Two trees T i ∈ F 1 and T j ∈ F 2 are paired, i.e., (T i , T j ) ∈ M x , iff T i has only one nearest neighbor in F 2 , namely T j , and vice versa. We sort the trees and compute a mapping for our windowed pq-gram distance, the ordered tree edit distance [43] (see Section 4), the pq-gram distance [4,6], the tree embedding distance [18], the binary branch distance [40], single path shingles [8], embedded pivots [34], and the node intersection distance. The node intersection distance is a simple algorithm that completely ignores the structure of the tree.…”

Section: Matching Address Datamentioning

confidence: 99%

“…Similarly, the binary branch snippets do not store the edges between a parent and its children (except the edge to the first child), leading to poor performance when many nodes in the trees have identical labels. Embedded pivots give a disproportionate weight to the root label: a quadratic number of snippets is produced from the root, while only a linear number is produced for each leaf [34]. The root label is the street name, thus address trees with different street names are unlikely to be paired.…”

Section: Matching Address Datamentioning

confidence: 99%

See 3 more Smart Citations

Windowed pq-grams for approximate joins of data-centric XML

et al. 2011

View full text Add to dashboard Cite

In data integration applications, a join matches elements that are common to two data sources. Since elements are represented slightly different in each source, an approximate join must be used to do the matching. For XML data, most existing approximate join strategies are based on some ordered tree matching technique, such as the tree edit distance. In data-centric XML, however, the sibling order is irrelevant, and two elements should match even if their subelement order varies. Thus, approximate joins for data-centric XML must leverage unordered tree matching techniques. This is computationally hard since the algorithms cannot rely on a predefined sibling order. In this paper, we give a solution for approximate joins based on unordered tree matching. The core of our solution are windowed pq-grams which are small subtrees of a specific shape. We develop an efficient technique to generate windowed pq-grams in a three-step process: sort the tree, extend the sorted tree with dummy nodes, and decompose the extended tree into windowed pq-grams. The windowed pq-grams distance between two trees is the number of pq-grams that are in one tree decomposition only. We show that our distance is a pseudometric and empirically demonstrate that it effectively approximates the unordered tree edit distance. The approximate join using windowed pq-grams can be efficiently implemented as an equality join on strings, which avoids the costly computation of the distance between every pair of input trees. Experiments with synthetic and real world data confirm the analytic results and show the effectiveness and efficiency of our technique. Abstract In data integration applications, a join matches elements that are common to two data sources. Since elements are represented slightly different in each source an approximate join must be used to do the matching. For XML data, most existing approximate join strategies are based on some ordered tree matching technique, such as the tree edit distance. In data-centric XML, however, the sibling order is irrelevant, and two elements should match even if their subelement order varies. Thus, approximate joins for datacentric XML must leverage unordered tree matching techniques. This is computationally hard since the algorithms cannot rely on a predefined sibling order.In this paper we give a solution for approximate joins based on unordered tree matching. The core of our solution are windowed pq-grams, which are small subtrees of a specific shape. We develop an efficient technique to generate windowed pq-grams in a three-step process: sort the tree, extend the sorted tree with dummy nodes, and decompose the extended tree into windowed pq-grams. The windowed pq-gram distance between two trees is the number of pq-grams that are in one tree decomposition only. We show that our distance is a pseudo-metric and empirically demonstrate that it effectively approximates the unordered tree edit distance. The approximate join using windowed pq-grams can be efficiently implemented as an equality join ...

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Scalability Of Profile and Index Computationmentioning

confidence: 99%

Section: Matching Address Datamentioning

confidence: 99%

Section: Matching Address Datamentioning

confidence: 99%

See 2 more Smart Citations

Windowed pq-grams for approximate joins of data-centric XML

et al. 2011

View full text Add to dashboard Cite

show abstract

“…Recently in [18], each tree is transformed into a set of pivots and the Jaccard Coefficient between two sets of pivots are used to approximate the tree edit distance. As is shown in [18], for unordered trees, their method approximates tree edit distance more accurately than pq-gram. In the case of ordered trees, their matching quality is lower than that using pq-gram [11].…”

Section: Related Workmentioning

confidence: 99%

A survey on tree edit distance lower bound estimation techniques for similarity join on XML data

Wang

et al. 2014

SIGMOD Rec.

View full text Add to dashboard Cite

When integrating tree-structured data from autonomous and heterogeneous sources, exact joins often fail for the same object may be represented differently. Approximate join techniques are often used, in which similar trees are considered describing the same real-world object. A commonly accepted metric to evaluate tree similarity is the tree edit distance. While yielding good results, this metric is computationally complex, thus has limited benefit for large databases. To make the join process efficient, many previous works take filtering and refinement mechanisms. They provide lower bounds for the tree edit distance in order to reduce unnecessary calculations. This work explores some widely accepted filtering and refinement based methods, and combines them to form multi-level filters. Experimental results indicate that string-based lower bounds are tighter yet more computationally complex than set-based lower bounds, and multi-level filters provide the tightest lower bound efficiently.

show abstract

Similarity Join on XML Based on k-Generation Set Distance

Wang

et al. 2012

Web-Age Information Management

View full text Add to dashboard Cite

Hashing tree-structured data: Methods and applications

Cited by 23 publications

References 30 publications

Windowed pq-grams for approximate joins of data-centric XML

Windowed pq-grams for approximate joins of data-centric XML

A survey on tree edit distance lower bound estimation techniques for similarity join on XML data

Similarity Join on XML Based on k-Generation Set Distance

Contact Info

Product

Resources

About