Synchronization via scheduling

Best, Micah J.; Mottishaw, Shane; Mustard, Craig; Roth, Mark A.; Fedorova, Alexandra; Brownsword, Andrew

doi:10.1145/1993498.1993573

Cited by 21 publications

(5 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We omit the details of the Node class and refer the reader to Intel TBB documentation [Intel (2007-)] for the details of the multifunction node, the join node and the continue msg classes. strategies require discipline in acquiring the locks, which was demonstrated by previous research [Abdelkhalek & Bilas (2004), Best et al (2011)] and also in Section 5.2.…”

Section: Tbb Bigroommentioning

confidence: 81%

“…Without the loss of generality, in this work we use the QuakeSquad game engine benchmark [Best et al (2011)] to describe the problems related to the game engine parallelization that apply to many MMOG games, especially first-person shooters, role playing games and real-time strategies. We show that hybrid dataflow models decrease the complexity of the parallel game engine implementation by eliminating or restructuring the explicit synchronization that is necessary in shared memory implementations.…”

Section: Quakesquadmentioning

confidence: 99%

“…Due to its complexity, the Quake game engine was substituted as a research application by its derivatives, such as the SynQuake benchmark [Lupei et al (2010)] and the QuakeSquad benchmark [Best et al (2011)]. These benchmarks capture the essential computational patterns and data structures found in commercial video games, such as Quake, while remaining simple enough for meaningful testing.…”

Section: Quakesquad Descriptionmentioning

confidence: 99%

“…A more practical solution is to employ coarse-grained locking. For example, the approach proposed in [Best et al (2011)] associates a single lock for each grid cell and processes all objects in a given grid cell by first acquiring the locks of all adjacent cells. Note that this approach requires a programmer to adopt a predetermined order of acquiring the locks to guarantee a deadlock-free execution.…”

Section: Implementation Details and Synchronization Requirementsmentioning

confidence: 99%

“…Hence, these are work-sharing implementations and variants of the region-based locking proposed in [Abdelkhalek & Bilas (2004)]. Instead of a complex finegrained locking scheme, we implement the coarse-grained locking of grid cells, in accordance with the lock-based implementation used in [Best et al (2011)]. Specifically, we associate a dedicated lock for each grid cell and apply a predefined order of locking to avoid deadlocks: each cell task first tries to acquire its own lock, and then the locks of adjacent grid cells, starting from the upper-left neighbor and moving clockwise.…”

Section: Parallel Game Engine Implementationsmentioning

confidence: 99%

See 4 more Smart Citations

Atomic dataflow model

Gajinov¹

View full text Add to dashboard Cite

With the recent switch in the design of general purpose processors from frequency scaling of a single processor core towards increasing the number of processor cores, parallel programming became important not only for scientific programming but also for general purpose programming. This also stressed the importance of programmability of existing parallel programming models which were primarily designed for performance. It was soon recognized that new programming models are needed that will make parallel programming possible not only to experts, but to a general programming community. Transactional Memory (TM) is an example which follows this premise. It improves dramatically over any previous synchronization mechanism in terms of programmability and composability, at the price of possibly reduced performance. The main source of performance degradation in Transactional Memory is the overhead of transactional execution. Our work on parallelizing Quake game engine is a clear example of this problem. We show that Software Transactional Memory is superior in terms of programmability compared to lock based programming, but that performance is hindered due to extreme amount of overhead introduced by transactional execution. In the meantime, a significant research effort has been invested in overcoming this problem. Our approach is aimed towards improving the performance of transactional code by reducing transactional data conflicts. The idea is based on the organization of the code in which highly conflicting data is promoted to dataflow tokens that coordinate the execution of transactions. The main contribution of this thesis is Atomic Dataflow model (ADF), a new task-based parallel programming model for C/C++ that integrates dataflow abstractions into the shared memory programming model. The ADF model provides language constructs that allow a programmer to delineate a program into a set of tasks and to explicitly define data dependencies for each task. The task dependency information is conveyed to the ADF runtime system that constructs a dataflow task graph that governs the execution of a program. Additionally, the ADF model allows tasks to share data. The key idea is that computation is triggered by dataflow between tasks but that, within a task, execution occurs by making atomic updates to common mutable state. To that end, the ADF model employs transactional memory, which guarantees atomicity of shared memory updates. The second contribution of this thesis is DaSH - the first comprehensive benchmark suite for hybrid dataflow and shared memory programming models. DaSH features 11 benchmarks, each representing one of the Berkeley dwarfs that capture patterns of communication and computation common to a wide range of emerging applications. DaSH includes sequential and shared-memory implementations based on OpenMP and TBB to facilitate easy comparison between hybrid dataflow implementations and traditional shared memory implementations. We use DaSH not only to evaluate the ADF model, but to also compare it with other two hybrid dataflow models in order to identify the advantages and shortcomings of such models, and motivate further research on their characteristics. Finally, we study applicability of hybrid dataflow models for parallelization of the game engine. We show that hybrid dataflow models decrease the complexity of the parallel game engine implementation by eliminating or restructuring the explicit synchronization that is necessary in shared memory implementations. The corresponding implementations also exhibit good scalability and better speedup than the shared memory parallel implementations, especially in the case of a highly congested game world that contains a large number of game objects. Ultimately, on an eight core machine we were able to obtain 4.72x speedup compared to the sequential baseline, and to improve 49% over the lock-based parallel implementation based on work-sharing. Con el reciente cambio en el diseño de los procesadores de propósito general pasando del aumento de frecuencia al incremento del número de núcleos, la programación paralela se ha convertido en importante no solo para la comunidad científica sino también para la programación en general. Este hecho ha enfatizado la importancia de la programabilidad de los modelos actuales de programación paralela, cuyo objetivo era el rendimiento. Pronto se observó la necesidad de nuevos modelos de programación, para hacer factible la programación paralela a toda la comunidad. Transactional Memory (TM) es un ejemplo de dicho objetivo. Supone una gran mejora sobre cualquier método anterior de sincronización en términos de programabilidad, con una posible reducción del rendimiento como coste. La razón principal de dicha degradación es el sobrecoste de la ejecución transaccional. Nuestro trabajo en la paralelización del motor del juego Quake es un claro ejemplo de este problema. Demostramos que Software Transactional Memory es superior en términos de programabilidad a los modelos de programación basados en locks, pero que el rendimiento es entorpecido por el sobrecoste introducido por TM. Mientras tanto, se ha invertido un importante esfuerzo de investigación para superar dicho problema. Nuestra solución se dirige hacia la mejora del rendimiento del código transaccional reduciendo los conflictos con la información contenida en las transacciones. La idea se basa en la organización del código en el cual la información conflictiva es promocionada a señales del flujo de datos que coordinan la ejecución de las transacciones. La contribución principal de esta tesis es Atomic Dataflow Model (ADF), un nuevo modelo de programación para C/C++ basado en tareas que integra abstracciones de flujo de datos en el modelo de programación de la memoria compartida. El modelo ADF provee construcciones del lenguaje que permiten al programador la definición del programa como un conjunto de tareas, además de la definición explícita de las dependencias de datos para cada tarea. La información de dependencia de la tarea se transmite al runtime de ADF, que construye un grafo de tareas que es el que controla la ejecución de un programa. Adicionalmente, el modelo ADF permite que las tareas compartan información. La idea principal es que la computación es activada por el flujo de datos entre tareas, pero que dentro de una tarea la ejecución ocurre haciendo actualizaciones atómicas a un estado común mutable. Para conseguir este fin, el modelo ADF utiliza TM, que garantiza la atomicidad en las modificaciones de la memoria compartida. La segunda contribución es DaSH, el primer conjunto de benchmarks para los modelos de programación de flujo de datos híbridos y los de memoria compartida. DaSH contiene 11 benchmarks, cada uno representativo de uno de los Berkeley dwarfs que captura patrones de comunicaciones y procesamiento comunes en un amplio rango de aplicaciones emergentes. DaSH incluye implementaciones secuenciales y de memoria compartida basadas en OpenMP y TBB que facilitan la comparación entre los modelos híbridos de flujo de datos e implementaciones de memoria compartida. Nosotros usamos DaSH no solo para evaluar ADF, sino también para compararlo con otros dos modelos híbridos para identificar sus ventajas. Finalmente, estudiamos la aplicabilidad de dichos modelos híbridos para la paralelización del motor del juego. Mostramos que disminuyen la complejidad de la implementación paralela, eliminando o reestructurando la sincronización explícita que es necesaria en las implementaciones de memoria compartida. También se observa una buena escalabilidad y una aceleración mejor, especialmente en el caso de un ambiente de juego muy cargado. En última instancia, sobre una máquina con ocho núcleos se ha obtenido una aceleración del 4.72x comparado con el código secuencial, y una mejora del 49% sobre la implementación paralela basada en locks.

show abstract

Section: Tbb Bigroommentioning

confidence: 81%

Section: Quakesquadmentioning

confidence: 99%

Section: Quakesquad Descriptionmentioning

confidence: 99%

Section: Implementation Details and Synchronization Requirementsmentioning

confidence: 99%

Section: Parallel Game Engine Implementationsmentioning

confidence: 99%

See 3 more Smart Citations

Atomic dataflow model

Gajinov¹

View full text Add to dashboard Cite

show abstract

Inference and Declaration of Independence in Task-Parallel Programs

Zakkak

Chasapis

Pratikakis

et al. 2013

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. The inherent difficulty of thread-based shared-memory programming has recently motivated research in high-level, task-parallel programming models. Recent advances of Task-Parallel models add implicit synchronization, where the system automatically detects and satisfies data dependencies among spawned tasks. However, dynamic dependence analysis incurs significant runtime overheads, because the runtime must track task resources and use this information to schedule tasks while avoiding conflicts and races. We present SCOOP, a compiler that effectively integrates static and dynamic analysis in code generation. SCOOP combines context-sensitive points-to, controlflow, escape, and effect analyses to remove redundant dependence checks at runtime. Our static analysis can work in combination with existing dynamic analyses and task-parallel runtimes that use annotations to specify tasks and their memory footprints. We use our static dependence analysis to detect non-conflicting tasks and an existing dynamic analysis to handle the remaining dependencies. We evaluate the resulting hybrid dependence analysis on a set of task-parallel programs.

show abstract

Elastic Manycores

Völp

Roitzsch

2014

Euro-Par 2013: Parallel Processing Workshops

View full text Add to dashboard Cite

By introducing asynchronous lambdas, many programming languages have leaped ahead in the race for programmable manycore systems, leaving the operating system and its scheduler behind. Instead of hiding application-inherent parallelism behind pools of threads with opaque behavior, asynchronous lambdas allow programmers to explicitly state which parts of a program can be executed in parallel and when this form of parallelism is available. Introducing stretch as a universal performance metric and externalizing part of the lambda-provided knowledge not only to the runtime but also to the operating system scheduler, this paper tries to lay the foundation for OS scheduling to catch up on the road towards heterogeneous elastic manycore systems.

show abstract

Synchronization via scheduling

Cited by 21 publications

References 16 publications

Atomic dataflow model

Atomic dataflow model

Inference and Declaration of Independence in Task-Parallel Programs

Elastic Manycores

Contact Info

Product

Resources

About