Abstract:This paper presents a MapReduce algorithm for computing pairwise document similarity in large document collections. MapReduce is an attractive framework because it allows us to decompose the inner products involved in computing document similarity into separate multiplication and summation stages in a way that is well matched to efficient disk access patterns across several machines. On a collection consisting of approximately 900,000 newswire articles, our algorithm exhibits linear growth in running time and … Show more
“…This is a classic algorithm that reads in two sets of data, a training set and an experimental set, and finds the k values in the training set closest to each value in the experimental set. It was first presented in [14] and is often used in statistical analysis applications, such as finding pairwise similarity [12].…”
Section: Selectionmentioning
confidence: 99%
“…Current research looks to push MapReduce by using it to solve harder problems. These include machine learning [7], statistical machine translation [6,11], optimization [20], finance [5], and similarity scoring [12]. MapReduce is a logical choice because it allows the problems to be solved on a loosely coupled set of machines, with less effort than producing custom parallel processing code.…”
Section: Related Workmentioning
confidence: 99%
“…Hence, we performed a case study of a wide variety of published MapReduce applications and investigated how to break the barrier for each of them. The applications we studied were the following: MapReduce example benchmarks [9]; machine learning benchmarks [7]; statistical machine translation [6,11]; optimization algorithms [20]; finance algorithms [5]; and similarity scoring [12].…”
The MapReduce model uses a barrier between the Map and Reduce stages. This provides simplicity in both programming and implementation. However, in many situations, this barrier hurts performance because it is overly restrictive. Thus, we develop a method to break the barrier in MapReduce in a way that improves efficiency. Careful design of our barrier-less MapReduce framework results in equivalent generality and retains ease of programming. We motivate our case with, and experimentally study our barrier-less techniques in, a wide variety of MapReduce applications divided into seven classes. Our experiments show that our approach can achieve better performance times than a traditional MapReduce framework. We achieve a reduction in job completion times that is 25% on average and 87% in the best case.
“…This is a classic algorithm that reads in two sets of data, a training set and an experimental set, and finds the k values in the training set closest to each value in the experimental set. It was first presented in [14] and is often used in statistical analysis applications, such as finding pairwise similarity [12].…”
Section: Selectionmentioning
confidence: 99%
“…Current research looks to push MapReduce by using it to solve harder problems. These include machine learning [7], statistical machine translation [6,11], optimization [20], finance [5], and similarity scoring [12]. MapReduce is a logical choice because it allows the problems to be solved on a loosely coupled set of machines, with less effort than producing custom parallel processing code.…”
Section: Related Workmentioning
confidence: 99%
“…Hence, we performed a case study of a wide variety of published MapReduce applications and investigated how to break the barrier for each of them. The applications we studied were the following: MapReduce example benchmarks [9]; machine learning benchmarks [7]; statistical machine translation [6,11]; optimization algorithms [20]; finance algorithms [5]; and similarity scoring [12].…”
The MapReduce model uses a barrier between the Map and Reduce stages. This provides simplicity in both programming and implementation. However, in many situations, this barrier hurts performance because it is overly restrictive. Thus, we develop a method to break the barrier in MapReduce in a way that improves efficiency. Careful design of our barrier-less MapReduce framework results in equivalent generality and retains ease of programming. We motivate our case with, and experimentally study our barrier-less techniques in, a wide variety of MapReduce applications divided into seven classes. Our experiments show that our approach can achieve better performance times than a traditional MapReduce framework. We achieve a reduction in job completion times that is 25% on average and 87% in the best case.
“…Previous works on automatic image annotation using graph-based SSL can be classified into two major technical components: (1) graph construction [4,5,6,7,8,9], and (2) label propagation [10,11,12,13,14,15,16,17,18,19].…”
Section: Previous Workmentioning
confidence: 99%
“…However, the use of b-matching may result in a high complexity which may not be practical to large-scale datasets. To remedy the deficiency, Elsayed et al in [4] proposed to compute pairwise document similarity by MapReduce [20], which is a programming model famous in large-scale distributed computing. Then Lin investaged three algorithms for pairwise similarity comparisons with MapReduce, and showed empirically that the brute force algorithm is the most efficient when exact similarity is desired [6].…”
Semi-supervised learning (SSL) is widely-used to explore the vast amount of unlabeled data in the world. Over the decade, graph-based SSL becomes popular in automatic image annotation due to its power of learning globally based on local similarity. However, recent studies have shown that the emergence of large-scale datasets challenges the traditional methods. On the other hand, most previous works have concentrated on single-label annotation, which may not describe image contents well. To remedy the deficiencies, this paper proposes a new graph-based SSL technique with multi-label propagation, leveraging the distributed computing power of the MapReduce programming model. For high learning performance, the paper further presents both a multi-layer learning structure and a tag refinement approach, where the former unifies both visual and textual information of image data during learning, while the latter simultaneously suppresses noisy tags and emphasizes the other tags after learning. Experimental results based on a medium-scale and a large-scale image datasets show the effectiveness of the proposed methods.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.