Selectivity estimation on streaming spatio-textual data using local correlations

2019 IEEE 35th International Conference on Data Engineering (ICDE)

et al. 2019

Self Cite

In this paper, we study the problem of approximate containment similarity search. Given two records Q and X, the containment similarity between Q and X with respect to Q is |Q∩X| |Q| . Given a query record Q and a set of records S, the containment similarity search finds a set of records from S whose containment similarity regarding Q is not less than the given threshold. This problem has many important applications in commercial and scientific fields such as record matching and domain search. Existing solution relies on the asymmetric LSH method by transforming the containment similarity to well-studied Jaccard similarity. In this paper, we use a inherently different framework by transforming the containment similarity to set intersection. We propose a novel augmented KMV sketch technique, namely GB-KMV, which is data-dependent and can achieve a much better trade-off between the sketch size and the accuracy. We provide a set of theoretical analysis to underpin the proposed augmented KMV sketch technique, and show that it outperforms the state-ofthe-art technique LSH-E in terms of estimation accuracy under practical assumption. Our comprehensive experiments on real-life datasets verify that GB-KMV is superior to LSH-E in terms of the space-accuracy trade-off, time-accuracy trade-off, and the sketch construction time. For instance, with similar estimation accuracy (F-1 score), GB-KMV is over 100 times faster than LSH-E on several real-life datasets.

Section: Related Workmentioning

confidence: 99%

GB-KMV: An Augmented KMV Sketch for Approximate Containment Similarity Search

2019 IEEE 35th International Conference on Data Engineering (ICDE)

et al. 2019

Self Cite

“…As shown in [9], Equation 3 can be modified to compound set operation where L = L A1 ⊕ ... ⊕ L An and k = min(k A1 , ..., k An ). An improved KMV sketch, named G-KMV, is proposed to estimate the multi-union size in [28]. G-KMV imposes a global threshold and ensures that all hash values smaller than the threshold will be kept.…”

Section: Kmv Synopsesmentioning

confidence: 99%

“…Although the set containment search query can be naturally modeled as range counting problem as discussed in Section 1, existing range counting techniques are exponentially dependent on the dimensionality (i.e., number of distinct elements in our problem) and not applicable to solving the containment selectivity estimation problem in our problem ( [13], [23]). Distinct value estimators (e.g., KMV [9], bottom-k, min-hash [13]) are adopted in [28] to solve subset containment search (i.e., query record is a subset of data record). We also extend the distinct value estimator KMV and develop the IL-GKMV approach in Section 3 and demonstrate theoretically and through extensive experiments that distinct value estimators cannot efficiently and accurately support the superset containment semantics studied in this paper.…”

Section: Related Workmentioning

confidence: 99%

“…We also analyse that the performance of distinct value estimators based approach degrades when the vocabulary size is large due to the inherent superset containment semantics of the problem studied in this paper. [28] studies selectivity estimation on streaming spatio-textual data where the textual data is a set of keywords/terms (i.e., elements). However, the query semantic is different as it specifies a subset containment search on the textual data, i.e., the keywords (elements) in the query should be contained by the keywords from spatial objects.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Selectivity Estimation on Set Containment Search

Database Systems for Advanced Applications

et al. 2019

Self Cite

In this paper, we study the problem of selectivity estimation on set containment search. Given a query record Q and a record dataset S, we aim to accurately and efficiently estimate the selectivity of set containment search of query Q over S. The problem has many important applications in commercial fields and scientific studies. To the best of our knowledge, this is the first work to study this important problem. We first extend existing distinct value estimating techniques to solve this problem and develop an inverted list and G-KMV sketch based approach IL-GKMV. We analyse that the performance of IL-GKMV degrades with the increase of vocabulary size. Motivated by limitations of existing techniques and the inherent challenges of the problem, we resort to developing effective and efficient sampling approaches and propose an ordered trie structure based sampling approach named OT-Sampling. OT-Sampling partitions records based on element frequency and occurrence patterns and is significantly more accurate compared with simple random sampling method and IL-GKMV. To further enhance performance, a divide-and-conquer based sampling approach, DC-Sampling, is presented with an inclusion/exclusion prefix to explore the pruning opportunities. We theoretically analyse the proposed techniques regarding various accuracy estimators. Our comprehensive experiments on 6 real datasets verify the effectiveness and efficiency of our proposed techniques.

“…We also analyze that the performance of distinct value estimator-based approach degrades when the vocabulary size is large due to the inherent superset containment semantics of the problem studied in this paper. Wang et al [32] study selectivity estimation on streaming spatio-textual data where the textual data are a set of keywords/terms (i.e., elements). However, the query semantic is different as it specifies a subset containment search on the textual data, i.e., the keywords (elements) in the query should be contained by the keywords from spatial objects.…”

Section: Challengesmentioning

confidence: 99%

Selectivity Estimation on Set Containment Search

et al. 2019

Data Sci. Eng.

Self Cite

In this paper, we study the problem of selectivity estimation on set containment search. Given a query record Q and a record dataset S , we aim to accurately and efficiently estimate the selectivity of set containment search of query Q over S. We first extend existing distinct value estimating techniques to solve this problem and develop an inverted list and G-KMV sketchbased approach IL-GKMV. We analyze that the performance of IL-GKMV degrades with the increase in vocabulary size. Motivated by limitations of existing techniques and the inherent challenges of the problem, we resort to developing effective and efficient sampling approaches and propose an ordered trie structure-based sampling approach named OT-Sampling. OT-Sampling partitions records based on element frequency and occurrence patterns and is significantly more accurate compared with simple random sampling method and IL-GKMV. To further enhance the performance, a divide-and-conquer-based sampling approach, DC-Sampling, is presented with an inclusion/exclusion prefix to explore the pruning opportunities. Meanwhile, we consider weighted set containment selectivity estimation and devise stratified random sampling approach named StrRS. We theoretically analyze the proposed techniques regarding various accuracy estimators. Our comprehensive experiments on nine real datasets verify the effectiveness and efficiency of our proposed techniques.