In this work we address the management of very large data sets, which need to be stored and processed across many computing sites. The motivation for our work is the ATLAS experiment for the Large Hadron Collider (LHC), where the authors have been involved in the development of the data management middleware. This middleware, called DQ2, has been used for the last several years by the ATLAS experiment for shipping petabytes of data to research centres and universities worldwide. We describe our experience in developing and deploying DQ2 on the Worldwide LHC computing Grid, a production Grid infrastructure formed of hundreds of computing sites. From this operational experience, we have identified an important degree of uncertainty that underlies the behaviour of large Grid infrastructures. This uncertainty is subjected to a detailed analysis, leading us to present novel modelling and simulation techniques for Data Grids. In addition, we discuss what we perceive as practical limits to the development of data distribution algorithms for Data Grids given the underlying infrastructure uncertainty, and propose future research directions. MANAGING VERY LARGE DISTRIBUTED DATA SETS 1339 Figure 1. Schematic overview of the LHC accelerator.The reasons for using multiple computing sites to store and process data include cost issues and availability of resources. A single computing site requires the concentration of resources in a single location, which is not compatible with large multinational consortiums funded by various national agencies. On the contrary, the use of distributed computing resources enables data-intensive applications to make opportunistic use of remote computing resources that would otherwise not be available. This distributed computing paradigm is referred to as a Data Grid [3].Other reasons for storing and processing data across multiple sites include geo-locality and fault tolerance. Geo-locality is the placement of data closer to its users reducing the network round-trip time required for data access. Fault tolerance in this context is related to the existence of multiple copies of the data, avoiding permanent or temporary loss of access in the event of catastrophic failure at a site.In the past years, the authors have been involved in the development and operation of the distributed data management system for a data-intensive application. The distributed data management system is called DQ2 and is used by the ATLAS experiment [4], which is part of the Large Hadron Collider (LHC) project.The LHC is a high energy physics particle accelerator experiment expected to start operation during the summer of 2009 and continue in production for about 20 years. The LHC particle accelerator extends for 27 km in a ring buried 100 m underground, as illustrated in Figure 1. Along this ring, there are various detectors that observe and record the outcome of high energy proton collisions. The raw data produced by just one of the detectors, the ATLAS experiment, amounts to tens of petabytes of data per year. These data...