Intermediate Data Caching Optimization for Multi-Stage and Parallel Big Data Frameworks

Yang, Zhengyu; Jia, Danlin; Ioannidis, Stratis; Mi, Ningfang; Sheng, Bo

doi:10.1109/cloud.2018.00042

Cited by 32 publications

(14 citation statements)

References 36 publications

(51 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…MR-SPS [21] designs a scalable parallel scheduling algorithm which improves scalability and performance of a cluster by managing workload and data locality. Studies [22]- [24] further investigate storage-related resource management problems, in order to improve the system performance bottlenecked by I/Os. BGMRS [25] is a MapReduce Scheduler based on the Bipartite Graph model.…”

Section: Related Workmentioning

confidence: 99%

New YARN Non-Exclusive Resource Management Scheme through Opportunistic Idle Resource Assignment

Yang

Yao

Gao

et al. 2021

IEEE Trans. Cloud Comput.

Self Cite

View full text Add to dashboard Cite

managing resources and improving throughput in a large-scale cluster has become a crucial problem with the explosion of data processing applications in recent years. Hadoop YARN and Mesos, as two universal resource management platforms, have been widely adopted in the commodity cluster for co-deploying multiple data processing frameworks, such as Hadoop MapReduce and Apache Spark. However, in the existing resource management, a certain amount of resources are exclusively allocated to a running task and can only be reassigned after that task is completed. This exclusive mode unfortunately leads to a potential problem that may under-utilize the cluster resources and degrade system performance. To address this issue, we propose a novel opportunistic and efficient resource allocation scheme, named OPERA, which breaks the barriers among the encapsulated resource containers by leveraging the knowledge of actual runtime resource utilizations to reassign opportunistic available resources to the pending tasks. OPERA avoids incurring severe performance interference to active tasks by further using two approaches to efficiently balances the starvations of reserved tasks and normal queued tasks. We implement and evaluate OPERA in Hadoop YARN v2.5. Our experimental results show that OPERA significantly reduces the average job execution time and increases the resource (CPU and memory) utilizations.

show abstract

Section: Related Workmentioning

confidence: 99%

New YARN Non-Exclusive Resource Management Scheme through Opportunistic Idle Resource Assignment

Yang

Yao

Gao

et al. 2021

IEEE Trans. Cloud Comput.

Self Cite

View full text Add to dashboard Cite

show abstract

“…They found that in-memory data analytics has some constraints with respect to limitations and performance. Yang et al [19] studied Apache Spark for data caching optimization with respect to big data analytics. They found that its RDD feature is very useful in this regard.…”

Section: Related Workmentioning

confidence: 99%

Distributed Computing Engines for Big Data Analytics

Prashanthi¹,

Sowjanya²,

Madhuri³

2019

IJRTE

View full text Add to dashboard Cite

Technologies like cloud computing paved way for dealing with massive amounts of data. Prior to cloud, it was not possible unless you invest large amounts for computing resources. Now there is ecosystem which is conducive to storing and processing voluminous data that cannot be handled by local computing resources. With such ecosystem, big data technology came into existence. Big data is the data characterized by volume, velocity, veracity and variety. This has enabled enterprises to give more value to every piece of data. This in turn led to the increased usage of cloud for both storage and processing. For processing big data efficient technologies are required. New programming paradigm like MapReduce with Hadoop distributed programming framework is widely used. However, there are other emerging frameworks like Apache Spark and Apache Flink to handle big data more efficiently. In this paper, empirical study is made on the three frameworks like Hadoop, Apache Spark and Apache Flink with different parameters like type of network, block size of HDFS, input data size and other configuration changes. The experimental results revealed that Apache Spark and Apache Flink outperform Hadoop. This is evaluated with different benchmark big data workloads.

show abstract

“…80% reduction in the distance, on average, was achieved compared to the distance obtained by direct transmission. Several studies discussed QoS routing in WSNs including [19], [20], [21]. The study in [19] presented a multi-objective genetic algorithm for efficient QoS routing in two tiered WSNs.…”

Section: Related Workmentioning

confidence: 99%

Adaptive Simulated Evolution based Approach for Cluster Optimization in Wireless Sensor Networks

Alsayyari¹

2018

ijacsa

View full text Add to dashboard Cite

Energy consumption minimization is crucial for the constrained sensors in wireless sensor networks (WSNs). Partitioning WSNs into optimal set of clusters is a promising technique utilized to minimize energy consumption and to increase the lifetime of the network. However, optimizing the network into optimal set of clusters is a non-polynomial (NP) hard problem, and the time needed to solve such problem increases exponentially as the number of sensors increases. In this paper, simulated evolution (SimE) algorithm is engineered to tackle the problem of cluster optimization in WSNs. A goodness measure is developed to measure the accuracy of assigning nodes to clusters and to evaluate the clustering quality of the overall network. SimE was developed such that the number of clusters and cluster heads are adaptive to number of alive nodes in the network. In fact, extensive simulation results demonstrate that SimE provides near optimal clustering and improves the lifetime of the network by about 21% compared to the traditional LEACH-C protocol.

show abstract

Intermediate Data Caching Optimization for Multi-Stage and Parallel Big Data Frameworks

Cited by 32 publications

References 36 publications

New YARN Non-Exclusive Resource Management Scheme through Opportunistic Idle Resource Assignment

New YARN Non-Exclusive Resource Management Scheme through Opportunistic Idle Resource Assignment

Distributed Computing Engines for Big Data Analytics

Adaptive Simulated Evolution based Approach for Cluster Optimization in Wireless Sensor Networks

Contact Info

Product

Resources

About