MapReduce Parallel Programming Model: A State-of-the-Art Survey

Li, Ren; Hu, Haibo; Li, Heng; Wu, Yingqin; Yang, Jianxi

doi:10.1007/s10766-015-0395-0

Cited by 61 publications

(24 citation statements)

References 77 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…It does not seem possible to compute inundation heights using MrGeo directly, but conceptually the computations are similar. The primary difference between MrGeo and our method is that MrGeo is based on the MapReduce framework and is locked into discrete iterations (Li, Hu, Li, Wu, & Yang, ). This means that all active tiles are started at once, and no new computations can begin until all existing computations are finished.…”

Section: Discussionmentioning

confidence: 99%

A distributed approach for calculating inundation height based on Dijkstra's algorithm

Grady

2018

Transactions in GIS

View full text Add to dashboard Cite

This research proposed a parallelized approach to scaling up the calculation of inundation height, the minimum sea-level rise required to inundate a cell on a digital elevation model, which is based on Dijkstra's algorithm for shortest-path calculations on a graph. Our approach is based on the concepts of spatial decomposition, calculate-and-correct, and a master/worker parallelization paradigm.The approach was tested using the U.S. Coastal Relief Model (CRM) dataset from the National Geophysical Data Center on a multicore desktop computer and various supercomputing resources through the U.S. Extreme Science and Engineering Discovery Environment (XSEDE) program. Our parallel implementation not only enables computations that were larger than previously possible, but also significantly outperforms serial implementations with respect to running time and memory footprint as the number of processing cores increases. The efficiency of the scalability seemed to be tied to tile size and flattened out at a certain number of workers.During the 20th century, world sea levels rose by 0.17 6 0.05 m (IPCC, 2007). The Intergovernmental Panel on Climate Change (IPCC) estimates that the rate of sea-level rise will roughly double over the next century due to increasing global temperatures, with a conservative projection of global sea-level rise of 0.18-0.59 m by 2100 (IPCC, 2007).Coastal inundation could have significant impacts, as nearly a quarter of the world's population lives at lower than 100 m and within 100 km of the coast (Nicholls et al., 2011). It is critical to know at what sea-level height and which coastal areas might be inundated in order to predict and mitigate economic and environmental impacts. Using Dijkstra's algorithm, Li et al. (2014) calculated inundation height (the minimum sea-level rise required to inundate a cell) on a raster that had approximately 46 million cells and took almost two hours. The calculations were for one tile from the National Geophysical Data Center (NGDC) Coastal Relief Model (CRM) dataset that has 537 one-degree by onedegree tiles (NOAA National Geophysical Data Center, 2014). Extrapolating for the entire dataset suggests that the amount of time and memory needed for the existing approach would not be feasible without specialized, highmemory, hardware. Even if a machine were able to handle the large data size, the running time required to perform the at 1.4 GHz in a single socket. Each node has 96 GB of DDR4 memory plus 16 GB high-speed MCDRAM and includes approximately 100 GB of SSD storage locally. There are three shared Lustre file systems available for each node, two with quotas of 10 GB and 1 TB per user and the third with approximately 30 PB of aggregate storage (TACC, 2018).Wrangler-TACC: Wrangler-TACC nodes are Dell R730 servers with two Intel Haswell E5-2680-v3 CPUs with 12 cores each running at 2.5 GHz, 128 GB of DDR4 memory, and 146 GB of local storage for the operating system. Each node has a 10 PB Lustre file system and 0.5 PB of shared flash storage high-performance para...

show abstract

Section: Discussionmentioning

confidence: 99%

A distributed approach for calculating inundation height based on Dijkstra's algorithm

Grady

2018

Transactions in GIS

View full text Add to dashboard Cite

show abstract

“…Hadoop is a parallel and distributed processing platform that uses the MapReduce computing paradigm [31,32] to uniformly distribute the computing tasks across data nodes to rapidly process large amounts of data on the Hadoop distributed file system (HDFS) [33]. MapReduce simplifies data processing using two functions, i.e., map and reduce.…”

Section: Related Workmentioning

confidence: 99%

“…The map function separates the data input into key-value pairs. It subsequently uses the computational power of the data nodes to process the key-value pairs and returns a set of intermediate key-value pairs to the reduce function for obtaining the results [32].…”

Section: Related Workmentioning

confidence: 99%

Recognizing Indonesian Acronym and Expansion Pairs with Supervised Learning and MapReduce

et al. 2020

View full text Add to dashboard Cite

During the previous decades, intelligent identification of acronym and expansion pairs from a large corpus has garnered considerable research attention, particularly in the fields of text mining, entity extraction, and information retrieval. Herein, we present an improved approach to recognize the accurate acronym and expansion pairs from a large Indonesian corpus. Generally, an acronym can be either a combination of uppercase letters or a sequence of speech sounds (syllables). Our proposed approach can be computationally divided into four steps: (1) acronym candidate identification;(2) acronym and expansion pair collection; (3) feature generation; and (4) acronym and expansion pair recognition using supervised learning techniques. Further, we introduce eight numerical features and evaluate their effectiveness in representing the acronym and expansion pairs based on the precision, recall, and F-measure. Furthermore, we compare the k-nearest neighbors (K-NN), support vector machine (SVM), and bidirectional encoder representations from transformers (BERT) algorithms in terms of accurate acronym and expansion pair classification. The experimental results indicate that the SVM polynomial model that considers eight features exhibits the highest accuracy (97.93%), surpassing those of the SVM polynomial model that considers five features (90.45%), the K-NN algorithm with k = 3 that considers eight features (96.82%), the K-NN algorithm with k = 3 that considers five features (95.66%), BERT-Base model (81.64%), and BERT-Base Multilingual Cased model (88.10%). Moreover, we analyze the performance of the Hadoop technology using various numbers of data nodes to identify the acronym and expansion pairs and obtain their feature vectors. The results reveal that the Hadoop cluster containing a large number of data nodes is faster than that with fewer data nodes when processing from ten million to one hundred million pairs of acronyms and expansions.Information 2020, 11, 210 2 of 13 predicted to considerably increase with the increasing number of smartphone users, which is predicted to reach 6.1 billion users in 2020 [9]. Furthermore, new online trading trends have contributed to the rapid accumulation of records in databases by ecommerce companies, such as Alibaba and Amazon, which generate and store several terabytes of data every day [7]. The analysis of a large amount of data requires machine learning techniques to automate the creation of analytical models based on historical data and then use the model for learning from the data [10], discovering useful patterns [11], and performing automated decisions with little human intervention [12]. Many queries are posed by millions of users from across the globe each day on Google's search engine, which has attracted considerable attention from researchers who have analyzed the query logs using machine learning techniques to track and predict phenomena, including the spread patterns of flu symptoms in the United States [5,6]. The web search queries are considered to be less biased a...

show abstract

“…The distribution of the large amount of data implies parallel computing since the same computations are performed on each CPU, but with a different dataset. (Li et al, 2015).…”

Section: Mapreduce Parallel Programming Modelmentioning

confidence: 99%

Using Mapreduce for Efficient Parallel Processing of Continuous K nearest Neighbors in Road Networks

Ferchichi¹,

Akaichi²

2016

JSSD

View full text Add to dashboard Cite

The problem of searching the continuous k Nearest Neighbor (CkNN) objects in road networks is a major challenge due to the highly dynamic nature of the road network environment. Also, the fast increasing number of moving objects poses a big challenge to the CkNN search of moving objects. In addition, it is important to deliver a valid response to the user in an optimal time while taking into account the large volume of data and the amount of changes in the characteristics of moving objects. To effectively explore the search space as well as reduce the time spent to deliver a response to the user, we propose to combine the strengths of Formal Concept Analysis (FCA), as a powerful mean of clustering the moving objects-related information, and the processing capabilities of MapReduce, as a well-known parallel programming model. The mathematical foundation of FCA allows offering an abstraction of the network based on the neighborhoods. We build the concept lattice based on the binary relations between the target points as well as their properties. The latter are collected from various sensors on the road network. We also propose a density-based road network partitioning approach and MapReduce function to distribute the search tasks. Finally, an implementation based on the Storm parallel programming model is discussed to show the effectiveness of our FCA-based solution.

show abstract

MapReduce Parallel Programming Model: A State-of-the-Art Survey

Cited by 61 publications

References 77 publications

A distributed approach for calculating inundation height based on Dijkstra's algorithm

A distributed approach for calculating inundation height based on Dijkstra's algorithm

Recognizing Indonesian Acronym and Expansion Pairs with Supervised Learning and MapReduce

Using Mapreduce for Efficient Parallel Processing of Continuous K nearest Neighbors in Road Networks

Contact Info

Product

Resources

About