George Tzenakis scite author profile

Abstract. We present BDDT, a task-parallel runtime system that dynamically discovers and resolves dependencies among parallel tasks. BDDT allows the programmer to specify detailed task footprints on any memory address range, multidimensional array tile or dynamic region. BDDT uses a block-based dependence analysis with arbitrary granularity. The analysis is applicable to existing C programs without having to restructure object or array allocation, and provides flexibility in array layouts and tile dimensions. We evaluate BDDT using a representative set of benchmarks, and we compare it to SMPSs (the equivalent runtime system in StarSs) and OpenMP. BDDT performs comparable to or better than SMPSs and is able to cope with task granularity as much as one order of magnitude finer than SMPSs. Compared to OpenMP, BDDT performs up to 3.9× better for benchmarks that benefit from dynamic dependence analysis. BDDT provides additional data annotations to bypass dependence analysis. Using these annotations, BDDT outperforms OpenMP also in benchmarks where dependence analysis does not discover additional parallelism, thanks to a more efficient implementation of the runtime system.

show abstract

Task-based parallel H.264 video encoding for explicit communication architectures

Alvanos

Tzenakis

Nikolopoulos

et al. 2011

View full text Add to dashboard Cite

Abstract-Future multi-core processors will necessitate exploitation of fine-grain, architecture-independent parallelism from applications to utilize many cores with relatively small local memories. We use c264, an end-to-end H.264 video encoder for the Cell processor based on x264, to show that exploiting finegrain parallelism remains challenging and requires significant advancement in runtime support. Our implementation of c264 achieves speedup between 4.7× and 8.6× on six synergistic processing elements (SPEs), compared to the serial version running on the power processing element (PPE). We find that the programming effort associated with efficient parallelization of c264 at fine granularity is highly non-trivial. Hand optimizations may improve performance significantly but are limited eventually by the code restructuring they require. We assess the complexity of exploiting fine-grain parallelism in realistic applications, by identifying optimizations of c264 and the effort they require.

show abstract

On the Viability of Microservers for Financial Analytics

Gillan

Nikolopoulos

Georgakoudis

et al. 2014

View full text Add to dashboard Cite

Analysis of dependence tracking algorithms for task dataflow execution

Vandierendonck

Tzenakis

Nikolopoulos

2013

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

Processor architectures has taken a turn toward many-core processors, which integrate multiple processing cores on a single chip to increase overall performance, and there are no signs that this trend will stop in the near future. Many-core processors are harder to program than multicore and single-core processors due to the need for writing parallel or concurrent programs with high degrees of parallelism. Moreover, many-cores have to operate in a mode of strong scaling because of memory bandwidth constraints. In strong scaling, increasingly finer-grain parallelism must be extracted in order to keep all processing cores busy.Task dataflow programming models have a high potential to simplify parallel programming because they alleviate the programmer from identifying precisely all intertask dependences when writing programs. Instead, the task dataflow runtime system detects and enforces intertask dependences during execution based on the description of memory accessed by each task. The runtime constructs a task dataflow graph that captures all tasks and their dependences. Tasks are scheduled to execute in parallel, taking into account dependences specified in the task graph.Several papers report important overheads for task dataflow systems, which severely limits the scalability and usability of such systems. In this article, we study efficient schemes to manage task graphs and analyze their scalability. We assume a programming model that supports input, output, and in/out annotations on task arguments, as well as commutative in/out and reductions. We analyze the structure of task graphs and identify versions and generations as key concepts for efficient management of task graphs. Then, we present three schemes to manage task graphs building on graph representations, hypergraphs, and lists. We also consider a fourth edgeless scheme that synchronizes tasks using integers. Analysis using microbenchmarks shows that the graph representation is not always scalable and that the edgeless scheme introduces least overhead in nearly all situations.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

George Tzenakis

A Unified Scheduler for Recursive and Task Dataflow Parallelism

BDDT: Block-Level Dynamic Dependence Analysis for Task-Based Parallelism

Task-based parallel H.264 video encoding for explicit communication architectures

On the Viability of Microservers for Financial Analytics

Analysis of dependence tracking algorithms for task dataflow execution

Contact Info

Product

Resources

About