Efficient K-Nearest Neighbor Graph Construction Using MapReduce for Large-Scale Data Sets

Warashina, Tomohiro; Aoyama, Kazuo; Sawada, Hiroshi; Hirose, Takashi

doi:10.1587/transinf.2014edp7108

Cited by 10 publications

(10 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Performance is improved sublinearly (∼ 1.6× for m = 20, ∼ 1.7× for m = 40). For comparison, the largest k-NN graph construction we are aware of used a dataset comprising 36.5 million 384d vectors, which took a cluster of 128 CPU servers 108.7 hours of compute [45], using NN-Descent [15]. Note that NN-Descent could also build or refine the k-NN graph for the datasets we consider, but it has a large memory overhead over the graph storage, which is already 80 GB for Deep1B.…”

Section: The K-nn Graphmentioning

confidence: 99%

Billion-Scale Similarity Search with GPUs

Johnson

Douze²,

Jeǵou³

2021

IEEE Trans. Big Data

1,981

1,110

View full text Add to dashboard Cite

Similarity search finds application in specialized database systems handling complex data such as images or videos, which are typically represented by high-dimensional features and require specific indexing structures. This paper tackles the problem of better utilizing GPUs for this task. While GPUs excel at data-parallel tasks, prior approaches are bottlenecked by algorithms that expose less parallelism, such as k-min selection, or make poor use of the memory hierarchy.We propose a design for k-selection that operates at up to 55% of theoretical peak performance, enabling a nearest neighbor implementation that is 8.5× faster than prior GPU state of the art. We apply it in different similarity search scenarios, by proposing optimized design for brute-force, approximate and compressed-domain search based on product quantization. In all these setups, we outperform the state of the art by large margins. Our implementation enables the construction of a high accuracy k-NN graph on 95 million images from the Yfcc100M dataset in 35 minutes, and of a graph connecting 1 billion vectors in less than 12 hours on 4 Maxwell Titan X GPUs. We have open-sourced our approach 1 for the sake of comparison and reproducibility.

show abstract

Section: The K-nn Graphmentioning

confidence: 99%

Billion-Scale Similarity Search with GPUs

Johnson

Douze²,

Jeǵou³

2021

IEEE Trans. Big Data

1,981

1,110

View full text Add to dashboard Cite

show abstract

“…When the caching data is used at the current execution stage, the caching file and the index file stored in the local file of the node are loaded (lines 8-10). Because no cached data exists for the first iteration, the data being configured as invariant data among map outcomes are stored in the caching file (lines [11][12][13][14]. At this time, the byte offset of the caching file is recorded into the index file for combining the caching data and map outcomes.…”

Section: Invariant Data Caching Mechanismmentioning

confidence: 99%

“…The maximum number of containers that can be allocated to each node is calculated by each node (lines 5-10). Depending on the type of each task, the containers are assigned to each node by using the usage status of node resources and the resource scheduling information (lines [11][12][13][14][15][16][17][18]. At this time, the existing policy of Hadoop is maintained as the container allocation policy.…”

Section: Iterative Resource Schedulermentioning

confidence: 99%

“…Figure 14 shows a pseudo-code of the k-Means application using our add-on iterative framework APIs. By just adding our add-on iterative framework APIs into the main function, a user can reuse the existing iterative applications without the modification of user-defined map/reduce classes (lines [8][9][10][11][12][13][14][15]. Therefore, our framework does not require the re-creation of the existing iterative applications.…”

Section: Add-on Iterative Framework Apismentioning

confidence: 99%

“…Thirdly, we devise an iterative resource management technique which can allocate resource uniformly to every node in a Hadoop cluster. For this, we store iteration information into a meta-data table in HDFS (Hadoop Distributed File System) [13]. Fourthly, we devise a stop condition check mechanism for preventing the unnecessary computation by comparing the current iteration output with the previous iteration output.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Cancelled: A New Efficient Resource Management Framework for Iterative MapReduce Processing in Large-Scale Data Analysis

Hong

Park

Lim

et al. 2017

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

Seungtae HONG†a) , Kyongseok PARK † †b) , Chae-Deok LIM †c) , Nonmembers, and Jae-Woo CHANG † † †d) , Member SUMMARYTo analyze large-scale data efficiently, studies on Hadoop, one of the most popular MapReduce frameworks, have been actively done. Meanwhile, most of the large-scale data analysis applications, e.g., data clustering, are required to do the same map and reduce functions repeatedly. However, Hadoop cannot provide an optimal performance for iterative MapReduce jobs because it derives a result by doing one phase of map and reduce functions. To solve the problems, in this paper, we propose a new efficient resource management framework for iterative MapReduce processing in large-scale data analysis. For this, we first design an iterative job state-machine for managing the iterative MapReduce jobs. Secondly, we propose an invariant data caching mechanism for reducing the I/O costs of data accesses. Thirdly, we propose an iterative resource management technique for efficiently managing the resources of a Hadoop cluster. Fourthly, we devise a stop condition check mechanism for preventing unnecessary computation. Finally, we show the performance superiority of the proposed framework by comparing it with the existing frameworks. key words: large-scale data analysis, iterative data processing framework, MapReduce, Hadoop

show abstract

A True $$O(n\log {n})$$ Algorithm for the All-k-Nearest-Neighbors Problem

2019

Lecture Notes in Computer Science

View full text Add to dashboard Cite

In this paper we examined an algorithm for the All-k-Nearest-Neighbor problem proposed in 1980s, which was claimed to have an O(n log n) upper bound on the running time. We find the algorithm actually exceeds the so claimed upper bound, and prove that it has an Ω(n 2 ) lower bound on the time complexity. Besides, we propose a new algorithm that truly achieves the O(n log n) bound. Detailed and rigorous theoretical proofs are provided to show the proposed algorithm runs exactly in O(n log n) time.

show abstract

Efficient K-Nearest Neighbor Graph Construction Using MapReduce for Large-Scale Data Sets

Cited by 10 publications

References 31 publications

Billion-Scale Similarity Search with GPUs

Billion-Scale Similarity Search with GPUs

Cancelled: A New Efficient Resource Management Framework for Iterative MapReduce Processing in Large-Scale Data Analysis

A True $$O(n\log {n})$$ Algorithm for the All-k-Nearest-Neighbors Problem

Contact Info

Product

Resources

About