Zhiquan Sui scite author profile

Malensek

ACM Trans. Auton. Adapt. Syst.

et al. 2015

Discrete event simulations model the behavior of complex, real-world systems. Simulating a wide range of events and conditions provides a more nuanced model, but also increases its computational footprint. To manage these processing requirements in a scalable manner, discrete event simulations can be distributed across multiple computing resources. Orchestrating the simulations in a distributed setting involves coping with resource uncertainty. We consider three key aspects of resource uncertainty: resource failures, heterogeneity, and slowdowns. Each of these aspects is managed autonomously, which involves making accurate predictions of future execution times and latencies while also accounting for differences in hardware capabilities and dynamic resource consumption profiles. Further complicating matters, individual tasks within the simulation are stateful and stochastic, requiring inter-task communication and synchronization to produce accurate outcomes. We deal with these challenges through intelligent state collection and migration, active resource monitoring, and empirical evaluation of resource capabilities under changing conditions. To underscore the viability of our solution, we provide benchmarks using a production discrete event simulation that can simultaneously sustain failures, manage resource heterogeneity, and handle slowdowns while being orchestrated by our framework.

show abstract

Autonomous, failure-resilient orchestration of distributed discrete event simulations

Malensek

et al. 2013

Discrete event simulations model the behavior of complex, real-world systems. Simulating a wide range of relevant events and conditions naturally provides a more accurate model, but also increases the computational workload associated with the simulation. To manage these processing requirements in a scalable manner, a discrete event simulation can be distributed across a number of computing resources. However, individual tasks in the simulation are stateful, and therefore require inter-task communication and synchronization to produce an accurate model. This property not only complicates the orchestration of the discrete event simulation in a distributed setting, but also makes providing reliable, fault-tolerant execution a challenge, especially when compared to conventional distributed fault tolerance schemes.In this paper, we propose an autonomous agent that provides fault tolerance functionality for discrete event simulations by predicting state changes in the simulation and adjusting its fault tolerance policy accordingly. This allows the system to avoid negatively impacting overall execution times while preserving reliability guarantees. To underscore the viability of our solution, we provide benchmarks of a production discrete event simulation that can sustain failures while running under the supervision of our fault tolerance framework.

show abstract

On the distributed orchestration of stochastic discrete event simulations

Concurrency and Computation

Pallickara

2013

Discrete event simulations are a powerful technique for modeling stochastic systems with multiple components where interactions between these components are governed by the probability distribution functions associated with them. Complex discrete event simulations are often computationally intensive with long completion times. This paper describes our solution to the problem of orchestrating the execution of a stochastic, discrete event simulation where computational hot spots evolve spatially over time. Our performance benchmarks report on our ability to balance computational loads in these settings.The Unix version of NAADSM was written for a batch computing environment in which the modeler submits a job, and later receives completed outputs from the job, with no interaction in between. In contrast, Granules is a highly asynchronous framework in which computations may receive messages over their communication streams at any time. This mismatch was resolved by having the worker component, which is coded in Java, handle the asynchronous communications and act as a buffer for incoming messages. Communication between the worker component and the NAADSM executable is then accomplished synchronously, at specific points in NAADSM's run loop, via the stdin and stdout streams of the NAADSM processes. Consistent with our goal of loose coupling, this avoids the need to make deep changes in the NAADSM code (such as transforming it into a multithreaded program) just to support interaction with Granules.Partitioning the simulation by geography presented a challenge. Some interactions between farms do not have distance constraints associated with them, owing to the model's original scope of Figure 3. Interactions between the controller and the workers.

show abstract

A Survey of Load Balancing Techniques for Data Intensive Computing

Pallickara

2011

Learning Based Distributed Orchestration of Stochastic Discrete Event Simulations

Pallickara

2014