A Scalable Index for Top-k Subtree Similarity Queries

Kocher, Daniel; Augsten, Nikolaus

doi:10.1145/3299869.3319892

Cited by 9 publications

(4 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The kNN-Join is not commutative, i.e., the order of the join partners matters. An efficient technique that leverages an inverted list on tokens that are partitioned into size stripes is the Cone algorithm [42], which is crafted for label sets in the context of top-k subtree similarity queries. To increase the limited scope of the original algorithm, we adapted it to leverage ScanCount.…”

Section: Sparse Vector-based Nn Methodsmentioning

confidence: 99%

Benchmarking Filtering Techniques for Entity Resolution

Papadakis

Fisichella

Schoger

et al. 2023

2023 IEEE 39th International Conference on Data Engineering (ICDE)

Self Cite

View full text Add to dashboard Cite

Entity Resolution is the task of identifying pairs of entity profiles that represent the same real-world object. To avoid checking a quadratic number of entity pairs, various filtering techniques have been proposed that fall into two main categories: (i) blocking workflows group together entity profiles with identical or similar signatures, and (ii) nearest-neighbor methods convert all entity profiles into vectors and identify the closest ones to every query entity. Unfortunately, the main techniques from these two categories have rarely been compared in the literature and, thus, their relative performance is unknown. We perform the first systematic experimental study that investigates the relative performance of the main representatives per category over numerous established datasets. Comparing techniques from different categories turns out to be a non-trivial task due to the various configuration parameters that are hard to fine-tune, but have a significant impact on performance. We consider a plethora of parameter configurations, optimizing each technique with respect to recall and precision targets. Both schema-agnostic and schema-based settings are evaluated. The experimental results provide novel insights into the effectiveness, the time efficiency and the scalability of the considered techniques.

show abstract

Section: Sparse Vector-based Nn Methodsmentioning

confidence: 99%

Benchmarking Filtering Techniques for Entity Resolution

Papadakis

Fisichella

Schoger

et al. 2023

2023 IEEE 39th International Conference on Data Engineering (ICDE)

Self Cite

View full text Add to dashboard Cite

show abstract

“…The kNN-Join is not commutative, i.e., the order of the join partners matters. An efficient technique that leverages an inverted list on tokens that are partitioned into size stripes is the Cone algorithm [35], which is crafted for label sets in the context of top-𝑘 subtree similarity queries. To increase the limited scope of the original algorithm, we adapted it to leverage ScanCount.…”

Section: String Similarity Joinsmentioning

confidence: 99%

How to reduce the search space of Entity Resolution: with Blocking or Nearest Neighbor search?

Papadakis

Fisichella

Schoger

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Entity Resolution suffers from quadratic time complexity. To increase its time efficiency, three kinds of filtering techniques are typically used for restricting its search space: (i) blocking workflows, which group together entity profiles with identical or similar signatures, (ii) string similarity join algorithms, which quickly detect entities more similar than a threshold, and (iii) nearest-neighbor methods, which convert every entity profile into a vector and quickly detect the closest entities according to the specified distance function. Numerous methods have been proposed for each type, but the literature lacks a comparative analysis of their relative performance. As we show in this work, this is a non-trivial task, due to the significant impact of configuration parameters on the performance of each filtering technique. We perform the first systematic experimental study that investigates the relative performance of the main methods per type over 10 real-world datasets. For each method, we consider a plethora of parameter configurations, optimizing it with respect to recall and precision. For each dataset, we consider both schema-agnostic and schema-based settings. The experimental results provide novel insights into the effectiveness and time efficiency of the considered techniques, demonstrating the superiority of blocking workflows and string similarity joins.

show abstract

“…DBLP [6] stores bibliographic data in XML format and includes, among others, authors, titles, and venues of computer science publications. Due to its availability and intuitiveness, the DBLP dataset has been used in many works for experimental purposes, e.g., as a collection of sets [44, 45], as a collection of trees [37, 38, 46], as a large hierarchical document [34, 40], and as a coauthor network graph [42, 49]. In this section, we show the impact of differences in the data preparation process that converts raw DBLP XML data into the desired input format.…”

Section: A Link Is Not Enoughmentioning

confidence: 99%

A Link is not Enough – Reproducibility of Data

et al. 2019

Self Cite

View full text Add to dashboard Cite

Although many works in the database community use open data in their experimental evaluation, repeating the empirical results of previous works remains a challenge. This holds true even if the source code or binaries of the tested algorithms are available. In this paper, we argue that providing access to the raw, original datasets is not enough. Real-world datasets are rarely processed without modification. Instead, the data is adapted to the needs of the experimental evaluation in the data preparation process. We showcase that the details of the data preparation process matter and subtle differences during data conversion can have a large impact on the outcome of runtime results. We introduce a data reproducibility model, identify three levels of data reproducibility, report about our own experience, and exemplify our best practices.

show abstract

A Scalable Index for Top-k Subtree Similarity Queries

Cited by 9 publications

References 35 publications

Benchmarking Filtering Techniques for Entity Resolution

Benchmarking Filtering Techniques for Entity Resolution

How to reduce the search space of Entity Resolution: with Blocking or Nearest Neighbor search?

A Link is not Enough – Reproducibility of Data

Contact Info

Product

Resources

About