E-science projects of various disciplines face a fundamental challenge: thousands of users want to obtain new scientific results by applicationspecific and dynamic correlation of data from globally distributed sources. Considering the involved enormous and exponentially growing data volumes, centralized data management reaches its limits. Since scientific data are often highly skewed and exploration tasks exhibit a large degree of spatial locality, we propose the locality-aware allocation of data objects onto a distributed network of interoperating databases. HiSbase is an approach to data management in scientific federated Data Grids that addresses the scalability issue by combining established techniques of database research in the field of spatial data structures (quadtrees), histograms, and parallel databases with the scalable resource sharing and load balancing capabilities of decentralized Peer-to-Peer (P2P) networks. The proposed combination constitutes a complementary e-science infrastructure enabling load balancing and increased query throughput.
Collaborative research in various scientific disciplines requires support for scalable data management enabling the efficient correlation of globally distributed data sources. Motivated by the expected data rates of upcoming projects and a growing number of users, communities explore new data management techniques for achieving high throughput. Community-driven data grids deliver such highthroughput data distribution for scientific federations by partitioning data according to application-specific data and query characteristics. Query hot spots are an important and challenging problem in this environment. Existing approaches to load-balancing from Peer-to-Peer (P2P) data management and sensor networks do not directly meet the requirements of a data-intensive e-science environment. In this paper, our contributions are partitioning schemes based on multi-dimensional index structures enabling communities to trade off data load balancing and handling query hot spots via splitting and replication. We evaluate the partitioning schemes with two typical kinds of data sets from the astrophysics domain and workloads extracted from Sloan Digital Sky Survey (SDSS) query traces and perform throughput measurements in real and simulated networks. The experiments demonstrate the improved workload distribution capabilities and give promising directions for the development of future community grids.
Abstract:MapReduce systems have become popular for processing large data sets and are increasingly being used in e-science applications. In contrast to simple application scenarios like word count, e-science applications involve complex computations which pose new challenges to MapReduce systems. In particular, (a) the runtime complexity of the reducer task is typically high, and (b) scientific data is often skewed. This leads to highly varying execution times for the reducers. Varying execution times result in low resource utilisation and high overall execution time since the next MapReduce cycle can only start after all reducers are done.In this paper we address the problem of efficiently processing MapReduce jobs with complex reducer tasks over skewed data. We define a new cost model that takes into account non-linear reducer tasks and we provide an algorithm to estimate the cost in a distributed environment. We propose two load balancing approaches, fine partitioning and dynamic fragmentation, that are based on our cost model and can deal with both skewed data and complex reduce tasks. Fine partitioning produces a fixed number of data partitions, dynamic fragmentation dynamically splits large partitions into smaller portions and replicates data if necessary. Our approaches can be seamlessly integrated into existing MapReduce systems like Hadoop. We empirically evaluate our solution on both synthetic data and real data from an e-science application.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.