A Micro-benchmark Suite for Evaluating HDFS Operations on Modern Clusters

Islam, Nusrat Sharmin; Lu, Xiaoyi; Wasi-ur-Rahman, Md.; Jose, Jithin; Panda, Dhabaleswar K.

doi:10.1007/978-3-642-53974-9_12

Cited by 13 publications

(10 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…For all tested data sizes the writing times are at least 2.2 times, both in time and throughput, slower than the reading times (column CDH Read/Write Δ) and the difference grows further with the increase of the data size. Similar behavior is observed in the results presented by Nicholas Wakou [25] (around 2.5 times slower writing times) and Islam et al [26].…”

Section: B Enhanced Dfsiosupporting

confidence: 91%

See 1 more Smart Citation

Performance Evaluation of Enterprise Big Data Platforms with HiBench

Ivanov

Niemann

Izberovic

et al. 2015

2015 IEEE Trustcom/BigDataSE/Ispa

View full text Add to dashboard Cite

In this paper, we evaluate the performance of DataStax Enterprise (DSE) using the HiBench benchmark suite and compare it with the corresponding Cloudera's Distribution of Hadoop (CDH) results. Both systems, DSE and CDH were stress tested using CPU-bound (WordCount), I/O-bound (Enhanced DFSIO) and mixed (HiveBench) workloads. The experimental results showed that DSE is better than CDH in writing files, whereas CDH is better than DSE in reading files. Additionally, for DSE the read and write throughput difference is very minor, whereas for CDH the read throughput is much higher than the write throughput. The results we obtained show that the HiBench benchmark suite, developed specifically for Hadoop, can be successfully executed on top of the DataStax Enterprise (DSE).

show abstract

Section: B Enhanced Dfsiosupporting

confidence: 91%

“…To process data sizes of 240 GB, 340 GB and 440 GB, the parameters for the file sizes were fixed to 400 MB and the parameters for the number of files to read and write were set to 615, 871 and 1127. The file size and number of files were chosen based on a results presented in related work [25], [26].…”

Section: B Enhanced Dfsiomentioning

confidence: 99%

Performance Evaluation of Enterprise Big Data Platforms with HiBench

Ivanov

Niemann

Izberovic

et al. 2015

2015 IEEE Trustcom/BigDataSE/Ispa

View full text Add to dashboard Cite

show abstract

“…The skew in this computational load mainly originates from the characteristics of the map/reduce function and the input dataset, which in turn could affect the number of key/value pairs or records generated by both the map and the reduce tasks. In this paper, we are (24,24) mainly concerned with the communication characteristics of the Hadoop MapReduce workloads. Therefore, for simplicity, we assume that all processes incur a similar computational load.…”

Section: Characterization Methodologymentioning

confidence: 99%

“…3. In addition to the above benchmarks that address the Hadoop framework as a whole, several microbenchmark suites have been designed to study individual components of the Apache Hadoop framework, such as Hadoop RPC [34], and Hadoop Distributed File Systems (HDFS) [24], and particularly Hadoop MapReduce [49], an extended version of which is presented in this paper.…”

Section: Background and Related Workmentioning

confidence: 99%

Characterizing and benchmarking stand-alone Hadoop MapReduce on modern HPC clusters

Shankar

Wasi-ur-Rahman

et al. 2016

J Supercomput

Self Cite

View full text Add to dashboard Cite

With the emergence of high-performance data analytics, the Hadoop platform is being increasingly used to process data stored on high-performance computing clusters. While there is immense scope for improving the performance of Hadoop MapReduce (including the network-intensive shuffle phase) over these modern clusters, that are equipped with high-speed interconnects such as InfiniBand and 10/40 GigE, and storage systems such as SSDs and Lustre, it is essential to study the MapReduce component in an isolated manner. In this paper, we study popular MapReduce workloads, obtained from well-accepted, comprehensive benchmark suites, to identify common shuffle data distribution patterns. We determine different environmental and workload-specific factors that affect the performance of the MapReduce job. Based on these characterization studies, we propose a microbenchmark suite that can be used to evaluate the performance of stand-alone Hadoop MapReduce, and demonstrate its ease-of-use with different networks/protocols, Hadoop distributions, and storage architectures. Performance evaluations with our proposed micro-benchmarks show that stand-alone Hadoop MapReduce over IPoIB performs better than 10 GigE by about 13-15 %, and the RDMA-enhanced hybrid MapReduce design can achieve up to 43 % performance improvement over default Hadoop MapReduce over IPoIB, in both shared-nothing and shared storage architectures.

show abstract

“…In this section, we have evaluated our design using the HDFS microbenchmark of Sequential Write Latency (SWL) [2]. Figure 4(a) shows the performance of our design on Cluster A.…”

Section: Evaluation Using Hdfs Microbenchmarkmentioning

confidence: 99%

Sor-HDFS

Islam

Rahman

et al. 2014

Proceedings of the 23rd International Symposium on High-Performance Parallel and Distributed Computing

Self Cite

View full text Add to dashboard Cite

In this paper, we propose SOR-HDFS, a SEDA (Staged EventDriven Architecture)-based approach to improve the performance of HDFS Write operation. This design not only incorporates RDMA-based communication over InfiniBand but also maximizes overlapping among different stages of data transfer and I/O. Performance evaluations show that, the new design improves the aggregated write throughput of Enhanced DFSIO benchmark in Intel HiBench by up to 64% and reduces the job execution time by 37% compared to IPoIB (IP over InfiniBand). Compared to the previous best RDMA-enhanced design [4], the improvements in throughput and execution time are 30% and 20%, respectively. Our design can also improve the performance of HBase Put operation by up to 53% over IPoIB and 29% compared to the previous best RDMAenhanced HDFS. To the best of our knowledge, this is the first design of SEDA-based HDFS in the literature.

show abstract

A Micro-benchmark Suite for Evaluating HDFS Operations on Modern Clusters

Cited by 13 publications

References 19 publications

Performance Evaluation of Enterprise Big Data Platforms with HiBench

Performance Evaluation of Enterprise Big Data Platforms with HiBench

Characterizing and benchmarking stand-alone Hadoop MapReduce on modern HPC clusters

Sor-HDFS

Contact Info

Product

Resources

About