Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies Short Pa 2008
DOI: 10.3115/1557690.1557767
|View full text |Cite
|
Sign up to set email alerts
|

Pairwise document similarity in large collections with MapReduce

Abstract: This paper presents a MapReduce algorithm for computing pairwise document similarity in large document collections. MapReduce is an attractive framework because it allows us to decompose the inner products involved in computing document similarity into separate multiplication and summation stages in a way that is well matched to efficient disk access patterns across several machines. On a collection consisting of approximately 900,000 newswire articles, our algorithm exhibits linear growth in running time and … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
141
0

Year Published

2011
2011
2018
2018

Publication Types

Select...
6
1

Relationship

0
7

Authors

Journals

citations
Cited by 160 publications
(142 citation statements)
references
References 7 publications
1
141
0
Order By: Relevance
“…This is a classic algorithm that reads in two sets of data, a training set and an experimental set, and finds the k values in the training set closest to each value in the experimental set. It was first presented in [14] and is often used in statistical analysis applications, such as finding pairwise similarity [12].…”
Section: Selectionmentioning
confidence: 99%
See 2 more Smart Citations
“…This is a classic algorithm that reads in two sets of data, a training set and an experimental set, and finds the k values in the training set closest to each value in the experimental set. It was first presented in [14] and is often used in statistical analysis applications, such as finding pairwise similarity [12].…”
Section: Selectionmentioning
confidence: 99%
“…Current research looks to push MapReduce by using it to solve harder problems. These include machine learning [7], statistical machine translation [6,11], optimization [20], finance [5], and similarity scoring [12]. MapReduce is a logical choice because it allows the problems to be solved on a loosely coupled set of machines, with less effort than producing custom parallel processing code.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Previous works on automatic image annotation using graph-based SSL can be classified into two major technical components: (1) graph construction [4,5,6,7,8,9], and (2) label propagation [10,11,12,13,14,15,16,17,18,19].…”
Section: Previous Workmentioning
confidence: 99%
“…However, the use of b-matching may result in a high complexity which may not be practical to large-scale datasets. To remedy the deficiency, Elsayed et al in [4] proposed to compute pairwise document similarity by MapReduce [20], which is a programming model famous in large-scale distributed computing. Then Lin investaged three algorithms for pairwise similarity comparisons with MapReduce, and showed empirically that the brute force algorithm is the most efficient when exact similarity is desired [6].…”
Section: Previous Workmentioning
confidence: 99%