Abstract. Leveraging data in distributed caches for large scale query processing applications is becoming more important, given current trends toward building large scalable distributed systems by connecting multiple heterogeneous less powerful machines rather than purchasing expensive homogeneous and very powerful machines. As more servers are added to such clusters, more memory is available for caching data objects across the distributed machines. However the cached objects are dispersed and traditional query scheduling policies that take into account only load balancing do not effectively utilize the increased cache space. We propose a new multi-dimensional range query scheduling policy for distributed query processing frameworks, called DEMB, that employs a probability distribution estimation derived from recent queries. DEMB accounts for both load balancing and the availability of distributed cached objects to both improve the cache hit rate for queries and thereby decrease query turnaround time and throughput. We experimentally demonstrate that DEMB produces better query plans and lower query response times than other query scheduling policies.
As modern large scale systems are built with a large number of independent small servers, it is becoming more important to orchestrate and leverage a large number of distributed buffer cache memory seamlessly. Several previous studies showed that with large scale distributed caching facilities, traditional resource scheduling policies often fail to exhibit high cache hit ratio and to achieve good system load balance. A scheduling policy that solely considers system load results in low cache hit ratio, and a scheduling policy that puts more emphasis on cache hit ratio than load balance suffers from system load imbalance. To maximize the overall system throughput, distributed caching facilities should balance the workloads and also leverage cached data at the same time. In this work, we present a distributed job processing framework that yields high cache hit ratio while achieving good system load balance, the two of which are most critical performance factors to improve overall system throughput and job response time. Our framework is a component-based distributed data analysis framework that supports geographically distributed multiple job schedulers. The job scheduler in our framework employs a distributed job scheduling policy -DEMA that considers both cache hit ratio and system load. In this paper, we show collaborative task scheduling can even further improve the performance by increasing the overall cache hit ratio while achieving load balance. Our experiments show that the proposed job scheduling policies outperform legacy load-based job scheduling policy in terms of job response time, load balancing, and cache hit ratio.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.