2014
DOI: 10.1587/transinf.2014edp7108
|View full text |Cite
|
Sign up to set email alerts
|

Efficient K-Nearest Neighbor Graph Construction Using MapReduce for Large-Scale Data Sets

Abstract: SUMMARYThis paper presents an efficient method using Hadoop MapReduce for constructing a K-nearest neighbor graph (K-NNG) from a large-scale data set. K-NNG has been utilized as a data structure for data analysis techniques in various applications. If we are to apply the techniques to a large-scale data set, it is desirable that we develop an efficient K-NNG construction method. We focus on NN-Descent, which is a recently proposed method that efficiently constructs an approximate K-NNG. NNDescent is implemente… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
10
0

Year Published

2017
2017
2023
2023

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 10 publications
(10 citation statements)
references
References 31 publications
0
10
0
Order By: Relevance
“…Performance is improved sublinearly (∼ 1.6× for m = 20, ∼ 1.7× for m = 40). For comparison, the largest k-NN graph construction we are aware of used a dataset comprising 36.5 million 384d vectors, which took a cluster of 128 CPU servers 108.7 hours of compute [45], using NN-Descent [15]. Note that NN-Descent could also build or refine the k-NN graph for the datasets we consider, but it has a large memory overhead over the graph storage, which is already 80 GB for Deep1B.…”
Section: The K-nn Graphmentioning
confidence: 99%
“…Performance is improved sublinearly (∼ 1.6× for m = 20, ∼ 1.7× for m = 40). For comparison, the largest k-NN graph construction we are aware of used a dataset comprising 36.5 million 384d vectors, which took a cluster of 128 CPU servers 108.7 hours of compute [45], using NN-Descent [15]. Note that NN-Descent could also build or refine the k-NN graph for the datasets we consider, but it has a large memory overhead over the graph storage, which is already 80 GB for Deep1B.…”
Section: The K-nn Graphmentioning
confidence: 99%
“…When the caching data is used at the current execution stage, the caching file and the index file stored in the local file of the node are loaded (lines 8-10). Because no cached data exists for the first iteration, the data being configured as invariant data among map outcomes are stored in the caching file (lines [11][12][13][14]. At this time, the byte offset of the caching file is recorded into the index file for combining the caching data and map outcomes.…”
Section: Invariant Data Caching Mechanismmentioning
confidence: 99%
“…The maximum number of containers that can be allocated to each node is calculated by each node (lines 5-10). Depending on the type of each task, the containers are assigned to each node by using the usage status of node resources and the resource scheduling information (lines [11][12][13][14][15][16][17][18]. At this time, the existing policy of Hadoop is maintained as the container allocation policy.…”
Section: Iterative Resource Schedulermentioning
confidence: 99%
See 2 more Smart Citations