Yunjun Gao scite author profile

The goal in similarity search is to find objects similar to a specified query object given a certain similarity criterion. Although useful in many areas, such as multimedia retrieval, pattern recognition, and computational biology, to name but a few, similarity search is not yet supported well by commercial DBMS. This may be due to the complex data types involved and the needs for flexible similarity criteria seen in real applications. We propose an efficient disk-based metric access method, the Space-filling curve and Pivot-based B + -tree (SPB-tree), to support a wide range of data types and similarity metrics. The SPB-tree uses a small set of so-called pivots to reduce significantly the number of distance computations, uses a spacefilling curve to cluster the data into compact regions, thus improving storage efficiency, and utilizes a B + -tree with minimum bounding box information as the underlying index. The SPB-tree also employs a separate random access file to efficiently manage a large and complex data. By design, it is easy to integrate the SPB-tree into an existing DBMS. We present efficient similarity search algorithms and corresponding cost models based on the SPB-tree. Extensive experiments using real and synthetic data show that the SPB-tree has much lower construction cost, smaller storage size, and can support more efficient similarity queries with high accuracy cost models than is the case for competing techniques. Moreover, the SPB-tree scales sublinearly with growing dataset size.

show abstract

Pivot-based metric indexing

Gao

Zheng

et al. 2017

Proc. VLDB Endow.

View full text Add to dashboard Cite

The general notion of a metric space encompasses a diverse range of data types and accompanying similarity measures. Hence, metric search plays an important role in a wide range of settings, including multimedia retrieval, data mining, and data integration. With the aim of accelerating metric search, a collection of pivot-based indexing techniques for metric data has been proposed, which reduces the number of potentially expensive similarity comparisons by exploiting the triangle inequality for pruning and validation. However, no comprehensive empirical study of those techniques exists. Existing studies each offers only a narrower coverage, and they use different pivot selection strategies that affect performance substantially and thus render cross-study comparisons difficult or impossible. We offer a survey of existing pivot-based indexing techniques, and report a comprehensive empirical comparison of their construction costs, update efficiency, storage sizes, and similarity search performance. As part of the study, we provide modifications for two existing indexing techniques to make them more competitive. The findings and insights obtained from the study reveal different strengths and weaknesses of different indexing techniques, and offer guidance on selecting an appropriate indexing technique for a given setting.

show abstract

Continuous visible nearest neighbor queries

Gao

Zheng

Lee

et al. 2009

View full text Add to dashboard Cite

In this paper, we identify and solve a new type of spatial queries, called continuous visible nearest neighbor (CVNN) search. Given a data set P , an obstacle set O, and a query line segment q, a CVNN query returns a set of p, R tuples such that p ∈ P is the nearest neighbor (NN) to every point r along the interval R ⊆ q as well as p is visible to r. Note that p may be NULL, meaning that all points in P are invisible to all points in R, due to the obstruction of some obstacles in O. In this paper, we formulate the problem and propose efficient algorithms for CVNN query processing, assuming that both P and O are indexed by R-trees. In addition, we extend our techniques to several variations of the CVNN query. Extensive experiments verify the efficiency and effectiveness of our proposed algorithms using both real and synthetic datasets.

show abstract

Efficient Metric Indexing for Similarity Search and Similarity Joins

Chen

Gao

et al. 2017

IEEE Trans. Knowl. Data Eng.

View full text Add to dashboard Cite

Spatial queries including similarity search and similarity joins are useful in many areas, such as multimedia retrieval, data integration, and so on. However, they are not supported well by commercial DBMSs. This may be due to the complex data types involved and the needs for flexible similarity criteria seen in real applications. In this paper, we propose a versatile and efficient diskbased index for metric data, the Space-filling curve and Pivot-based B +-tree (SPB-tree). This index leverages the B +-tree, and uses space-filling curve to cluster data into compact regions, thus achieving storage efficiency. It utilizes a small set of so-called pivots to reduce significantly the number of distance computations when using the index. Further, it makes use of a separate random access file to support a broad range of data. By design, it is easy to integrate the SPB-tree into an existing DBMS. We present efficient algorithms for processing similarity search and similarity joins, as well as corresponding cost models based on SPB-trees. Extensive experiments using both real and synthetic data show that, compared with state-of-the-art competitors, the SPB-tree has much lower construction cost, smaller storage size, and supports more efficient similarity search and similarity joins with high accuracy cost models.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Yunjun Gao

Manipulation in digital word-of-mouth: A reality check for book reviews

Efficient metric indexing for similarity search

Pivot-based metric indexing

Continuous visible nearest neighbor queries

Efficient Metric Indexing for Similarity Search and Similarity Joins

Contact Info

Product

Resources

About