Pairwise document similarity in large collections with MapReduce

Elsayed, Tamer; Lin, Jimmy; Oard, Douglas W.

doi:10.3115/1557690.1557767

Cited by 160 publications

(142 citation statements)

References 7 publications

Supporting

Mentioning

141

Contrasting

Order By: Relevance

“…This is a classic algorithm that reads in two sets of data, a training set and an experimental set, and finds the k values in the training set closest to each value in the experimental set. It was first presented in [14] and is often used in statistical analysis applications, such as finding pairwise similarity [12].…”

Section: Selectionmentioning

confidence: 99%

“…Current research looks to push MapReduce by using it to solve harder problems. These include machine learning [7], statistical machine translation [6,11], optimization [20], finance [5], and similarity scoring [12]. MapReduce is a logical choice because it allows the problems to be solved on a loosely coupled set of machines, with less effort than producing custom parallel processing code.…”

Section: Related Workmentioning

confidence: 99%

“…Hence, we performed a case study of a wide variety of published MapReduce applications and investigated how to break the barrier for each of them. The applications we studied were the following: MapReduce example benchmarks [9]; machine learning benchmarks [7]; statistical machine translation [6,11]; optimization algorithms [20]; finance algorithms [5]; and similarity scoring [12].…”

Section: Classifying Reduce Operationsmentioning

confidence: 99%

See 2 more Smart Citations

Breaking the MapReduce stage barrier

et al. 2011

View full text Add to dashboard Cite

The MapReduce model uses a barrier between the Map and Reduce stages. This provides simplicity in both programming and implementation. However, in many situations, this barrier hurts performance because it is overly restrictive. Thus, we develop a method to break the barrier in MapReduce in a way that improves efficiency. Careful design of our barrier-less MapReduce framework results in equivalent generality and retains ease of programming. We motivate our case with, and experimentally study our barrier-less techniques in, a wide variety of MapReduce applications divided into seven classes. Our experiments show that our approach can achieve better performance times than a traditional MapReduce framework. We achieve a reduction in job completion times that is 25% on average and 87% in the best case.

show abstract

Section: Selectionmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Classifying Reduce Operationsmentioning

confidence: 99%

See 1 more Smart Citation

Breaking the MapReduce stage barrier

et al. 2011

View full text Add to dashboard Cite

show abstract

“…Previous works on automatic image annotation using graph-based SSL can be classified into two major technical components: (1) graph construction [4,5,6,7,8,9], and (2) label propagation [10,11,12,13,14,15,16,17,18,19].…”

Section: Previous Workmentioning

confidence: 99%

“…However, the use of b-matching may result in a high complexity which may not be practical to large-scale datasets. To remedy the deficiency, Elsayed et al in [4] proposed to compute pairwise document similarity by MapReduce [20], which is a programming model famous in large-scale distributed computing. Then Lin investaged three algorithms for pairwise similarity comparisons with MapReduce, and showed empirically that the brute force algorithm is the most efficient when exact similarity is desired [6].…”

Section: Previous Workmentioning

confidence: 99%

Graph-based semi-supervised learning with multi-modality propagation for large-scale image datasets

Lee

Hsieh

et al. 2013

Journal of Visual Communication and Image Representation

View full text Add to dashboard Cite

Semi-supervised learning (SSL) is widely-used to explore the vast amount of unlabeled data in the world. Over the decade, graph-based SSL becomes popular in automatic image annotation due to its power of learning globally based on local similarity. However, recent studies have shown that the emergence of large-scale datasets challenges the traditional methods. On the other hand, most previous works have concentrated on single-label annotation, which may not describe image contents well. To remedy the deficiencies, this paper proposes a new graph-based SSL technique with multi-label propagation, leveraging the distributed computing power of the MapReduce programming model. For high learning performance, the paper further presents both a multi-layer learning structure and a tag refinement approach, where the former unifies both visual and textual information of image data during learning, while the latter simultaneously suppresses noisy tags and emphasizes the other tags after learning. Experimental results based on a medium-scale and a large-scale image datasets show the effectiveness of the proposed methods.

show abstract

Relevant Filtering in a Distributed Content‐based Publish/Subscribe System

Mouza¹,

Travers²

2018

NoSQL Data Models

View full text Add to dashboard Cite

Pairwise document similarity in large collections with MapReduce

Cited by 160 publications

References 7 publications

Breaking the MapReduce stage barrier

Breaking the MapReduce stage barrier

Graph-based semi-supervised learning with multi-modality propagation for large-scale image datasets

Relevant Filtering in a Distributed Content‐based Publish/Subscribe System

Contact Info

Product

Resources

About