Optimizing the Hadoop MapReduce Framework with high-performance storage devices

Moon, Sangwhan; Lee, John J.; Sun, Xiling; Kee, Yang-Suk

doi:10.1007/s11227-015-1447-3

Cited by 27 publications

(8 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Hadoop is a multi-tasking system which can process multiple data sets for multiple jobs in a multi-user environment across multiple machines at the same time [42] [43]. Each MapReduce job consists of multiple processes submitting I/Os concurrently for Map, Shuffle and Reduce stages, each having skewed I/O requirements [44] [45] [46]. Hadoop Distributed File System (HDFS) uses a block-structured file system to deliver reliable storage [13] [43].YARN (Yet Another Resource Negotiator) is used for per-application based resource negotiating agent and is a centralized platform to ensure consistency and data manageability.…”

Section: Hadoop Ecosystem and Mapreducementioning

confidence: 99%

LDM: Lineage-Aware Data Management in Multi-tier Storage Systems

Mishra

Somani

2019

Lecture Notes in Networks and Systems

View full text Add to dashboard Cite

We design and develop LDM, a novel data management solution to cater the needs of applications exhibiting the lineage property, i.e. in which the current writes are future reads. In such a class of applications, slow writes significantly hurt the over-all performance of jobs, i.e. current writes determine the fate of next reads. We believe that in a large scale shared production cluster, the issues associated due to data management can be mitigated at a way higher layer in the hierarchy of the I/O path, even before requests to data access are made. Contrary to the current solutions to data management which are mostly reactive and/or based on heuristics, LDM is both deterministic and pro-active. We develop block-graphs, which enable LDM to capture the complete time-based data-task dependency associations, therefore use it to perform life-cycle management through tiering of data blocks. LDM amalgamates the information from the entire data center ecosystem, right from the application code, to file system mappings, the compute and storage devices topology, etc. to make oracle-like deterministic data management decisions. With trace-driven experiments, LDM is able to achieve 29-52% reduction in over-all data center workload execution time. Moreover, by deploying LDM with extensive pre-processing creates efficient data consumption pipelines, which also reduces write and read delays significantly.Abstract. We design and develop LDM, a novel data management solution to cater the needs of applications exhibiting the lineage property, i.e. in which the current writes are future reads. In such a class of applications, slow writes significantly hurt the over-all performance of jobs, i.e. current writes determine the fate of next reads. We believe that in a large scale shared production cluster, the issues associated due to data management can be mitigated at a way higher layer in the hierarchy of the I/O path, even before requests to data access are made. Contrary to the current solutions to data management which are mostly reactive and/or based on heuristics, LDM is both deterministic and pro-active. We develop block-graphs, which enable LDM to capture the complete time-based data-task dependency associations, therefore use it to perform life-cycle management through tiering of data blocks. LDM amalgamates the information from the entire data center ecosystem, right from the application code, to file system mappings, the compute and storage devices topology, etc. to make oracle-like deterministic data management decisions. With trace-driven experiments, LDM is able to achieve 29% to 52% reduction in over-all data center workload execution time. Moreover, by deploying LDM with extensive pre-processing creates efficient data consumption pipelines, which also reduces write and read delays significantly.

show abstract

Section: Hadoop Ecosystem and Mapreducementioning

confidence: 99%

LDM: Lineage-Aware Data Management in Multi-tier Storage Systems

Mishra

Somani

2019

Lecture Notes in Networks and Systems

View full text Add to dashboard Cite

show abstract

“…Due to the physical limitations of HDDs, there have been recent efforts [1,2,11,[18][19][20][21] in incorporating flash based storage such as SSDs in data centers. The high-speed, non-volatile storage devices like SSDs typically referred to as SCMs access data via electrical signals, as opposed to physical disk arm movement in the case of HDDs [3,9].…”

Section: Secondary Storage (Block Device) Characteristicsmentioning

confidence: 99%

“…As deletion or erase happens in the granularity of blocks, therefore a single page update requires a complete block erase and out-of-place write. These result into unwanted phenomenons such as write amplification (wear-leveling) and garbage collection (faulty block management) [9,19,24]. These activities consume a lot of CPU time as well as the SSD controller and the File System have additional jobs such as book-keeping than simple data access.…”

Section: Secondary Storage (Block Device) Characteristicsmentioning

confidence: 99%

Host managed contention avoidance storage solutions for Big Data

Mishra

Somani

2017

J Big Data

View full text Add to dashboard Cite

The performance gap between compute and storage is fairly considerable. This results in a mismatch between the application needs from storage and what storage can deliver. The full potential of storage devices cannot be harnessed till all layers of I/O hierarchy function efficiently. Despite advanced optimizations applied across various layers along the odyssey of data access, the I/O stack still remains volatile. The problems associated due to the inefficiencies in data management get amplified in Big Data shared resource environments. The Linux OS (host) block layer is the most critical part of the I/O hierarchy, as it orchestrates the I/O requests from different applications to the underlying storage. Unfortunately, despite it's significance, the block layer, essentially the block I/O scheduler, hasn't evolved to meet the needs of Big Data. We have designed and developed two contention avoidance storage solutions, collectively known as "BID: Bulk I/O Dispatch" in the Linux block layer specifically to suit multitenant, multi-tasking shared Big Data environments. Hard disk drives (HDDs) form the backbone of data center storage. The data access time in HDDs is majorly governed by disk arm movements, which usually occurs when data is not accessed sequentially. Big Data applications exhibit evident sequentiality but due to the contentions amongst other I/O submitting applications, the I/O accesses get multiplexed which leads to higher disk arm movements. BID schemes aim to exploit the inherent I/O sequentiality of Big Data applications to improve the overall I/O completion time by reducing the avoidable disk arm movements. In the first part, we propose a dynamically adaptable block I/O scheduling scheme BID-HDD for disk based storage. BID-HDD tries to recreate the sequentiality in I/O access in order to provide performance isolation to each I/O submitting process. Through trace driven simulation based experiments with cloud emulating MapReduce benchmarks, we show the effectiveness of BID-HDD which results in 28-52% lesser time for all I/O requests than the best performing Linux disk schedulers. In the second part, we propose a hybrid scheme BID-Hybrid to exploit SCM's (SSDs) superior random performance to further avoid contentions at disk based storage. BID-Hybrid is able to efficiently offload non-bulky interruptions from HDD request queue to SSD queue using BID-HDD for disk request processing and multi-q FIFO architecture for SSD. This results in performance gain of 6-23% for MapReduce workloads when compared to BID-HDD and 33-54% over best performing Linux scheduling scheme. BID schemes as a whole is aimed to avoid contentions for disk based storage I/Os following system constraints without compromising SLAs.

show abstract

“…In this paper, a large data platform is built to collect and store real-time dynamic steelmaking production data, and the Hadoop distributed file system (HDFS) is used to realize the virtual resource storage of large data in the steelmaking process. The CNN convolution neural network algorithm [3] is used to predict the composition of steel slag; the prediction is made to provide a basic basis for the application of the steel slag resource application recommendation system. Based on this, a resource utilization system of steel slag based on the background of big data is established, which will eventually provide new ideas for steel slag treatment and application in steel companies.…”

Section: Introductionmentioning

confidence: 99%

Systematic Research on the Application of Steel Slag Resources under the Background of Big Data

Kang

Zhang

et al. 2018

Complexity

View full text Add to dashboard Cite

The large-scale and resourceful utilization of solid waste is one of the important ways of sustainable development. The big data brings hope for further development in all walks of life, because huge amounts of data insist on the principle of "turning waste into treasure". The steel big data has been taken as the research object in this paper. Firstly, a big data collection and storage system has been set up based on the Hadoop platform. Secondly, the steel slag prediction model based on the convolution neural network (CNN) is established. The material data of steelmaking, the operation data of steelmaking process, and the data of steel slag composition are put into the model from the Hadoop platform, and the prediction of the slag composition is further realized. Then, the alternatives for resource recovery are obtained according to the predicted composition of the steel slag. And considering the three aspects of economic feasibility, resource suitability, and environmental acceptance, the comprehensive evaluation system based on AHP is established to realize the recommendation of the optimal resource approach. Finally, taking a steel plant in Hebei as an example, the alternatives according to the prediction of the composition of steel slag are blast furnace iron-making, recycling waste steel, and cement admixture. The comprehensive evaluation values of the three resources are 0.48, 0.57, and 0.76, respectively, and the optimized resource of the steel slag produced by the steel plant is used as the cement admixture.

show abstract

Optimizing the Hadoop MapReduce Framework with high-performance storage devices

Cited by 27 publications

References 7 publications

LDM: Lineage-Aware Data Management in Multi-tier Storage Systems

LDM: Lineage-Aware Data Management in Multi-tier Storage Systems

Host managed contention avoidance storage solutions for Big Data

Systematic Research on the Application of Steel Slag Resources under the Background of Big Data

Contact Info

Product

Resources

About