Production grids are complex and highly variable systems whose behavior is not well understood and difficult to anticipate. The goal of this study is to estimate the impact of the variability of those infrastructures on the performance of workflow-based applications. A probabilistic model of workflows execution time is proposed and evaluated. Results show that the variability of the EGEE grid infrastructure impacts the execution time of a particular medical image analysis application by a factor 2. The model gives interesting insights on the grid behavior for different application parallelization modes.
I. PERFORMANCE ANALYSIS ON PRODUCTION GRIDSIn many scientific areas, applications with stringent requirements for high performance computing, large data sets analysis and complex computation flows have emerged. Pushed by these new computational challenges very large scale production grids infrastructures have been deployed world-wide. Such widely distributed systems have been operating 24/7 over several years now, providing a sustained high end computing facility that many applications exploit routinely. The experience gained exploiting these systems shows that they can hardly be compared to traditional clusters performing on local area networks. For instance, we showed in a previous work that setting a timeout value to the jobs is mandatory on production grids whereas it is useless on most clusters [1]. Such differences may come from various factors. First, the reliability and homogeneity of clusters and local networks cannot be assumed on grids. Second, grids face very variable load patterns and race conditions originating from the shared exploitation by large user communities. Finally, the heterogeneity and the volatility of grid resources further increases the variability.Consequently, production grids exhibit hard to predict behaviors that result in variable overheads imposed to the computations from the users point of view. For instance, we observed that over thousands of computation tasks submitted to the EGEE production grid 1 in the same experimental conditions during months, an average delay of approximately 5 minutes with a standard deviation of the same order of magnitude (5 minutes) is experienced. For grid applications requiring the submission of a very large number of short (less than 1 hour long) jobs in parallel, such overheads are far from being negligible. As a result, applications computation time (makespans) 1 Now affiliated to the University of Amsterdam 2 Enabling Grids for E-sciencE, http://www.eu-egee.org are hardly forecastable, which makes performance analysis on production grids very difficult. In particular, the impact of the variability of the platform on the application should be quantified, as some works already suggested that it may have a strong negative impact on the applications [2].The objective of this paper is to propose a grid application makespan model that (i) aims at explaining the performance of applications on production grids, (ii) allows to study the impact of grid va...