The cloud has emerged as an attractive platform for resource-intensive scientific applications, such as epidemic simulators. However, building and executing such applications in the cloud presents multiple challenges, including exploiting elasticity, handling failures, and simplifying multi-cloud deployment. To address these challenges, this paper proposes a novel service-based framework called DiFFuSE that enables simulation applications with a bag-of-tasks structure to fully exploit cloud platforms. This paper describes how the framework is applied to restructure two legacy applications, simulating the spread of bovine viral diarrhea virus and Mycobacterium avium subspecies paratuberculosis, into elastic cloud-native applications. Experimental results show that the framework enhances application performance and allows exploring different cost-performance trade-offs while supporting automatic failure handling and elastic resource acquisition from multiple clouds.
KEYWORDScloud computing, epidemic simulation, high performance computing, simulation models
INTRODUCTIONAs global transport networks continue to expand, the risk of infectious disease transmission grows, making research in pathogen spread on humans and animals increasingly important. An essential tool for studying pathogen spread is computer simulation based on mechanistic modeling. However, running epidemic simulations at large scale is time consuming for multiple reasons. First, the simulation models can capture fine-grained dynamics, considering different population scales (individual, population, and meta-population) and accounting for multiple layouts (eg, combining population and infection dynamics with farm management decisions). 1 Second, some processes can be data-driven and exploit large amounts of real-life observed data, mostly concerning the population dynamics of hosts. Third, understanding the underlying biological system requires performing and comparing many simulation scenarios, eg, through a model sensitivity analysis. Finally, producing reliable results demands large numbers of simulation runs when models are stochastic.Speeding up the execution of large-scale simulations requires parallelism. Parallelism can be implemented at the single-computer level using multiple cores and processors as well as the distributed level using multiple computers interconnected through a network. There are currently three main options for obtaining the necessary computing resources for running simulations in parallel. The first option is using dedicated high-end multicore systems or clusters. The limitation is that such infrastructures are typically shared among many local users, resulting in long waiting times. At the same time, extending such infrastructures to increase user satisfaction is time consuming and costly. The second option is grid computing, which allows sharing computers from multiple clusters in multiple sites. The main limitation of grids is the heterogeneity of the software stacks in each cluster, which makes application deployment di...