On Efficient Hierarchical Storage for Big Data Processing

Krish, K. R.; Wadhwa, Bharti; Iqbal, Murium; Rafique, M. Mustafa; Butt, Ali R.

doi:10.1109/ccgrid.2016.61

Cited by 21 publications

(21 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Despite superior random performance of SCMs (or SSDs) over HDDs, replacing slower disks with SCMs doesn't seem to be economically feasible for data center applications [1,9]. Few of the major disadvantages of SCMs are enumerated below:…”

Section: Secondary Storage (Block Device) Characteristicsmentioning

confidence: 99%

“…Due to the physical limitations of HDDs, there have been recent efforts [1,2,11,[18][19][20][21] in incorporating flash based storage such as SSDs in data centers. The high-speed, non-volatile storage devices like SSDs typically referred to as SCMs access data via electrical signals, as opposed to physical disk arm movement in the case of HDDs [3,9].…”

Section: Secondary Storage (Block Device) Characteristicsmentioning

confidence: 99%

“…SCMs are used as cache for disk based storage, coupled with workload-aware tiering [1,3,4,25] for automatic classification of data to balance cost, performance and capacity.…”

Section: Secondary Storage (Block Device) Characteristicsmentioning

confidence: 99%

“…With highly sophisticated and optimized data processing frameworks, such as Hadoop and Spark, applications are capable of processing large amounts of data at the same time. Dedicating physical resources for every application is not economically feasible [1]. In cloud environments, with the aid of server and storage virtualization, multiple processes contend for the same physical resource (namely, compute, network and storage) [2].…”

Section: Introductionmentioning

confidence: 99%

“…With recent developments in NVMe (non-volatile memory) devices such as solid state drives (SSDs), commonly known as storage class memories (SCM) [9], with supporting infrastructure, and, virtualization techniques, a hybrid approach of using heterogeneous tiers of storage together such as those having HDDs and SSDs coupled with workloadaware tiering to balance cost, performance and capacity have become increasingly popular [1,3]. In the second part, we propose a novel hybrid scheme BID-Hybrid to exploit SCM's (SSDs) superior random performance to further avoid contentions at disk based storage.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Host managed contention avoidance storage solutions for Big Data

Mishra

Somani

2017

J Big Data

View full text Add to dashboard Cite

The performance gap between compute and storage is fairly considerable. This results in a mismatch between the application needs from storage and what storage can deliver. The full potential of storage devices cannot be harnessed till all layers of I/O hierarchy function efficiently. Despite advanced optimizations applied across various layers along the odyssey of data access, the I/O stack still remains volatile. The problems associated due to the inefficiencies in data management get amplified in Big Data shared resource environments. The Linux OS (host) block layer is the most critical part of the I/O hierarchy, as it orchestrates the I/O requests from different applications to the underlying storage. Unfortunately, despite it's significance, the block layer, essentially the block I/O scheduler, hasn't evolved to meet the needs of Big Data. We have designed and developed two contention avoidance storage solutions, collectively known as "BID: Bulk I/O Dispatch" in the Linux block layer specifically to suit multitenant, multi-tasking shared Big Data environments. Hard disk drives (HDDs) form the backbone of data center storage. The data access time in HDDs is majorly governed by disk arm movements, which usually occurs when data is not accessed sequentially. Big Data applications exhibit evident sequentiality but due to the contentions amongst other I/O submitting applications, the I/O accesses get multiplexed which leads to higher disk arm movements. BID schemes aim to exploit the inherent I/O sequentiality of Big Data applications to improve the overall I/O completion time by reducing the avoidable disk arm movements. In the first part, we propose a dynamically adaptable block I/O scheduling scheme BID-HDD for disk based storage. BID-HDD tries to recreate the sequentiality in I/O access in order to provide performance isolation to each I/O submitting process. Through trace driven simulation based experiments with cloud emulating MapReduce benchmarks, we show the effectiveness of BID-HDD which results in 28-52% lesser time for all I/O requests than the best performing Linux disk schedulers. In the second part, we propose a hybrid scheme BID-Hybrid to exploit SCM's (SSDs) superior random performance to further avoid contentions at disk based storage. BID-Hybrid is able to efficiently offload non-bulky interruptions from HDD request queue to SSD queue using BID-HDD for disk request processing and multi-q FIFO architecture for SSD. This results in performance gain of 6-23% for MapReduce workloads when compared to BID-HDD and 33-54% over best performing Linux scheduling scheme. BID schemes as a whole is aimed to avoid contentions for disk based storage I/Os following system constraints without compromising SLAs.

show abstract

Section: Secondary Storage (Block Device) Characteristicsmentioning

confidence: 99%

Section: Secondary Storage (Block Device) Characteristicsmentioning

confidence: 99%

“…SCMs are used as cache for disk based storage, coupled with workload-aware tiering [1,3,4,25] for automatic classification of data to balance cost, performance and capacity.…”

Section: Secondary Storage (Block Device) Characteristicsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Host managed contention avoidance storage solutions for Big Data

Mishra

Somani

2017

J Big Data

View full text Add to dashboard Cite

show abstract

LDM: Lineage-Aware Data Management in Multi-tier Storage Systems

Mishra

Somani

2019

Lecture Notes in Networks and Systems

View full text Add to dashboard Cite

We design and develop LDM, a novel data management solution to cater the needs of applications exhibiting the lineage property, i.e. in which the current writes are future reads. In such a class of applications, slow writes significantly hurt the over-all performance of jobs, i.e. current writes determine the fate of next reads. We believe that in a large scale shared production cluster, the issues associated due to data management can be mitigated at a way higher layer in the hierarchy of the I/O path, even before requests to data access are made. Contrary to the current solutions to data management which are mostly reactive and/or based on heuristics, LDM is both deterministic and pro-active. We develop block-graphs, which enable LDM to capture the complete time-based data-task dependency associations, therefore use it to perform life-cycle management through tiering of data blocks. LDM amalgamates the information from the entire data center ecosystem, right from the application code, to file system mappings, the compute and storage devices topology, etc. to make oracle-like deterministic data management decisions. With trace-driven experiments, LDM is able to achieve 29-52% reduction in over-all data center workload execution time. Moreover, by deploying LDM with extensive pre-processing creates efficient data consumption pipelines, which also reduces write and read delays significantly.Abstract. We design and develop LDM, a novel data management solution to cater the needs of applications exhibiting the lineage property, i.e. in which the current writes are future reads. In such a class of applications, slow writes significantly hurt the over-all performance of jobs, i.e. current writes determine the fate of next reads. We believe that in a large scale shared production cluster, the issues associated due to data management can be mitigated at a way higher layer in the hierarchy of the I/O path, even before requests to data access are made. Contrary to the current solutions to data management which are mostly reactive and/or based on heuristics, LDM is both deterministic and pro-active. We develop block-graphs, which enable LDM to capture the complete time-based data-task dependency associations, therefore use it to perform life-cycle management through tiering of data blocks. LDM amalgamates the information from the entire data center ecosystem, right from the application code, to file system mappings, the compute and storage devices topology, etc. to make oracle-like deterministic data management decisions. With trace-driven experiments, LDM is able to achieve 29% to 52% reduction in over-all data center workload execution time. Moreover, by deploying LDM with extensive pre-processing creates efficient data consumption pipelines, which also reduces write and read delays significantly.

show abstract