Managing Very-Large Distributed Datasets

Branco, M. De Oliveira; Zaluska, Ed; Roure, David De; Salgado, P. E. De Castro Faria; Garonne, V.; Lassnig, M.; Rocha, Ricardo

doi:10.1007/978-3-540-88871-0_54

Cited by 4 publications

(4 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our previous work investigated the system behaviour of DQ2 and we found a direct dependence of concurrent bursts on global performance [20], [21]. Unfortunately, it is difficult to effectively manage such bursts in the system [8], due to the heterogeneity of involved services. Furthermore, the principles of burst identification and prediction are likely to assist modelling of arbitrary types of workloads, respectively any time series [12], [31].…”

Section: Introductionmentioning

confidence: 98%

“…To date, DQ2 is considered one of the largest open data management environments ever built and an example of a global multigrid / cloud hybrid system [9]. Our previous work investigated the system behaviour of DQ2 and we found a direct dependence of concurrent bursts on global performance [20], [21].…”

Section: Introductionmentioning

confidence: 99%

“…The context of this work is the operational environment of Don Quijote 2 (DQ2), a petabyte-scale distributed data management system [8], [9] for the high-energy physics experiment ATLAS at the Large Hadron Collider (LHC). To date, DQ2 is considered one of the largest open data management environments ever built and an example of a global multigrid / cloud hybrid system [9].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Identification, Modelling and Prediction of Non-periodic Bursts in Workloads

Lassnig

Fahringer

Garonne

et al. 2010

2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing

Self Cite

View full text Add to dashboard Cite

Non-periodic bursts are prevalent in workloads of large scale applications. Existing workload models do not predict such non-periodic bursts very well because they mainly focus on repeatable base functions. We begin by showing the necessity to include bursts in workload models by investigating their detrimental effects in a petabyte-scale distributed data management system. This work then makes three contributions. First, we analyse the accuracy of five existing prediction models on workloads of data and computational grids, as well as derived synthetic workloads. Second, we introduce a novel averages-based model to predict bursts in arbitrary workloads. Third, we present a novel metric, mean absolute estimated distance, to assess the prediction accuracy of the model. Using our model and metric, we show that burst behaviour in workloads can be identified, quantified and predicted independently of the underlying base functions. Furthermore, our model and metric are applicable to arbitrary kinds of burst prediction for time series.

show abstract

Section: Introductionmentioning

confidence: 98%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Identification, Modelling and Prediction of Non-periodic Bursts in Workloads

Lassnig

Fahringer

Garonne

et al. 2010

2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing

Self Cite

View full text Add to dashboard Cite

show abstract

“…U.S. Open Science Grid (OSG), Tera Grid, and the EU's European grid (Euro Grid) focus on computing. American physical grid (GriPhyN) [4] and the Euro Data Grid are typical data grid [5][6][7]. Grid technology has also been in a rapid development in China.…”

Section: Introductionmentioning

confidence: 99%

An Open Job Scheduling Service for Large-Scale Data Processing

Zou

Zhou

et al. 2011

2011 Sixth Open Cirrus Summit

View full text Add to dashboard Cite

Abstract-Scientific Data Grid (SDG) of Chinese Academy of Sciences aims at integrating distributed scientific data, providing transparent data access mechanism and efficient data analysis， processing and visualization services. SDG Job Scheduler (SDGJS) adopts an open and service-oriented framework. The scheduling policy of SDGJS considers both performance of computing nodes and distribution of large data so that it can achieve an efficient job processing. SDGJS has successfully served for some data grid applications.

show abstract

Managing very large distributed data sets on a data grid

Branco

Zaluska

Roure

et al. 2010

Concurrency and Computation

View full text Add to dashboard Cite

In this work we address the management of very large data sets, which need to be stored and processed across many computing sites. The motivation for our work is the ATLAS experiment for the Large Hadron Collider (LHC), where the authors have been involved in the development of the data management middleware. This middleware, called DQ2, has been used for the last several years by the ATLAS experiment for shipping petabytes of data to research centres and universities worldwide. We describe our experience in developing and deploying DQ2 on the Worldwide LHC computing Grid, a production Grid infrastructure formed of hundreds of computing sites. From this operational experience, we have identified an important degree of uncertainty that underlies the behaviour of large Grid infrastructures. This uncertainty is subjected to a detailed analysis, leading us to present novel modelling and simulation techniques for Data Grids. In addition, we discuss what we perceive as practical limits to the development of data distribution algorithms for Data Grids given the underlying infrastructure uncertainty, and propose future research directions. MANAGING VERY LARGE DISTRIBUTED DATA SETS 1339 Figure 1. Schematic overview of the LHC accelerator.The reasons for using multiple computing sites to store and process data include cost issues and availability of resources. A single computing site requires the concentration of resources in a single location, which is not compatible with large multinational consortiums funded by various national agencies. On the contrary, the use of distributed computing resources enables data-intensive applications to make opportunistic use of remote computing resources that would otherwise not be available. This distributed computing paradigm is referred to as a Data Grid [3].Other reasons for storing and processing data across multiple sites include geo-locality and fault tolerance. Geo-locality is the placement of data closer to its users reducing the network round-trip time required for data access. Fault tolerance in this context is related to the existence of multiple copies of the data, avoiding permanent or temporary loss of access in the event of catastrophic failure at a site.In the past years, the authors have been involved in the development and operation of the distributed data management system for a data-intensive application. The distributed data management system is called DQ2 and is used by the ATLAS experiment [4], which is part of the Large Hadron Collider (LHC) project.The LHC is a high energy physics particle accelerator experiment expected to start operation during the summer of 2009 and continue in production for about 20 years. The LHC particle accelerator extends for 27 km in a ring buried 100 m underground, as illustrated in Figure 1. Along this ring, there are various detectors that observe and record the outcome of high energy proton collisions. The raw data produced by just one of the detectors, the ATLAS experiment, amounts to tens of petabytes of data per year. These data...

show abstract

Managing Very-Large Distributed Datasets

Cited by 4 publications

References 18 publications

Identification, Modelling and Prediction of Non-periodic Bursts in Workloads

Identification, Modelling and Prediction of Non-periodic Bursts in Workloads

An Open Job Scheduling Service for Large-Scale Data Processing

Managing very large distributed data sets on a data grid

Contact Info

Product

Resources

About