Index-based, High-dimensional, Cosine Threshold Querying with Optimality Guarantees

Li, Yuliang; Wang, Jianguo; Pullman, Benjamin; Bandeira, Nuno; Papakonstantinou, Yannis

doi:10.1007/s00224-020-10009-6

Cited by 3 publications

(1 citation statement)

References 59 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In search algorithms for a text data set, an inverted-file data structure is often adopted for invariant database that contains a set of object feature vectors [35,20,26,42,4,15,28] as in Fig. 1(b).…”

Section: Inverted-file Based Algorithmsmentioning

confidence: 99%

Structured Inverted-File k-Means Clustering for High-Dimensional Sparse Data

Aoyama¹,

Saito

2022

Preprint

View full text Add to dashboard Cite

This paper presents an architecture-friendly k-means clustering algorithm referred to as SIVF for a large-scale and high-dimensional sparse data set. Algorithm efficiency on time is often measured by the number of costly operations such as similarity calculations. In practice, it depends greatly on how an algorithm adapts to an architecture of the computer system which the algorithm is executed on. Our proposed SIVF is carefully designed so as to operate at high speed and suppress memory usage on modern CPU architectures. It exploits a structured inverted-file for a mean set with an invariant centroid-pair based filter (ICP) to decrease the number of similarity calculations in an architecture-friendly manner. The structured inverted-file with the ICP effectively reduce instructions as well as similarity calculations, suppressing pipeline hazards that may cause pipeline stalls. We demonstrate in our experiments on real large-scale document data sets that SIVF operates at higher speed and with lower memory consumption than existing algorithms. Our performance analysis reveals that SIVF works at high speed by suppressing performance degradation factors of the number of instructions, cache misses, and branch mispredictionsrather than less similarity calculations.

show abstract

Section: Inverted-file Based Algorithmsmentioning

confidence: 99%