Power Density-Aware Resource Management for Heterogeneous Tiled Multicores

Khdr, Heba; Pagani, Santiago; Sousa, Ericles; Lari, Vahid; Pathania, Anuj; Hannig, Frank; Shafique, Muhammad; Teich, Jürgen; Henkel, Jörg

doi:10.1109/tc.2016.2595560

Cited by 58 publications

(20 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Most designs stick to more basic accelerator interaction and memory sharing models [6] [7], where the shared data is placed in contiguous memory using a specific userspace API [18] or by replacing the standard malloc() with a customized implementation [19]. Address translation is performed explicitly by the host as part of the DMA transfer preparation from contiguous main memory to the accelerator's local memories [14] [29] [30].…”

Section: Related Workmentioning

confidence: 99%

“…The focus of previous work on SVM for FPGA accelerators lies on reducing the TLB service time by using either a soft processor [22], [24] or dedicated hardware [18], [19], [27], [28], [32], [37], [54] for managing the TLB with a size of 64 entries at most. As opposed to letting the host 7. The L2 TLB alone uses more than 60% of the resources.…”

Section: Alternative Svm Designsmentioning

confidence: 99%

“…In the embedded systems world, the situation is different [6] [7]. While for quite some time the major FPGA vendors have had devices on the market that combine multicore, general-purpose host processors with FPGA fabrics in Manuscript received February 5, 2018; revised August 5, 2018. heterogeneous systems-on-chip (SoCs) [8] [9], shared virtual memory (SVM) is still not widely adopted.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Exploring Shared Virtual Memory for FPGA Accelerators with a Configurable IOMMU

Vogel

Marongiu

Benini

2019

IEEE Trans. Comput.

View full text Add to dashboard Cite

A key enabler for the ever-increasing adoption of FPGA accelerators is the availability of frameworks allowing for the seamless coupling to general-purpose host processors. Embedded FPGA+CPU systems still heavily rely on copy-based host-to-accelerator communication, which complicates application development.In this paper, we present a hardware/software framework for enabling transparent, shared virtual memory for FPGA accelerators in embedded SoCs. It can use a hard-macro IOMMU if available, or a configurable soft-core IOMMU that we provide. We explore different TLB configurations and provide a comparison with other designs for shared virtual memory to gain insight on performance-critical IOMMU components. Experimental results using pointer-rich benchmarks show that our framework not only simplifies FPGA-accelerated application development, it also achieves up to 13x speedup compared to traditional copy-based offloading.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Alternative Svm Designsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Exploring Shared Virtual Memory for FPGA Accelerators with a Configurable IOMMU

Vogel

Marongiu

Benini

2019

IEEE Trans. Comput.

View full text Add to dashboard Cite

show abstract

“…Today's novel multi-cores allow to scale the frequency and voltage of each core independently, opening novel opportunities for fine-grained DTM solutions [1]. Operating systems use reactive controllers to maintain the processors under a critical temperature, while several approaches in the state-of-the-art explore proactive approaches to improve DTM performances [2], [3], [4], [5].…”

Section: Introductionmentioning

confidence: 99%

Prediction horizon vs. efficiency of optimal dynamic thermal control policies in HPC nodes

Cesarini

Bartolini

Benini

2017

2017 IFIP/IEEE International Conference on Very Large Scale Integration (VLSI-SoC)

View full text Add to dashboard Cite

We are entering the era of thermally-bound computing: Advanced and costly cooling solutions are needed to sustain the high computing densities of high-performance computing equipment. To reduce cooling costs and cooling overprovisioning, dynamic thermal management (DTM) strategies aim at controlling the device temperature by modulating online the performance of processing elements. While operating systems allow the migration of threads between cores, in HPC systems the threads of parallel applications are pinned to the allocated cores at start-time to avoid job-migration overheads. In this scenario state-of-the-art DTM solutions, which use thermal models to map jobs to cores, are based on long-term predictions to map the most critical job to the coldest core. Instead, turbo-mode and DVFS controllers are based on short-term predictions to squeeze the thermal capacitance allowing for short period performance boosts which are thermally unsustainable.In this work we propose an integer-linear programming formulation and a fast solver for controlling, at the same time, the job mapping and cores frequency selections in HPC nodes, tested with real supercomputer workload. Our approach can be integrated with the MPI runtimes and OpenMP libraries and is capable of assigning high-performance cores to performancecritical threads. We show that by combining long and short term predictions with information of the programming model we can significantly improve the performance of final application w.r.t. state-of-the-art DTM solutions.

show abstract

“…With the increasing density of circuits, the very small and close transistors cannot dissipate heat fast enough and circuits can be damaged. Khdr et al [86] and Sousa et al [133] propose runtime adaptation systems that reconfigure actor-based application so that the underlying platform stays below a power and/or temperature budget. In [133], they dynamically adjust the quality or bitrate of video encoding to fit into constraints while still providing good processing quality.…”

Section: Conclusion Chapter 5 Schedulingmentioning

confidence: 99%

Algorithms and Framework for Energy Efficient Parallel Stream Computing on Many-Core Architectures

Melot¹

View full text Add to dashboard Cite

The rise of many-core processor architectures in the market answers to a constantly growing need of processing power to solve more and more challenging problems such as the ones in computing for big data. Fast computation is more and more limited by the very high power required and the management of the considerable heat produced. Many programming models compete to take profit of many-core architectures to improve both execution speed and energy consumption, each with their advantages and drawbacks. The work described in this thesis is based on the dataflow computing approach and investigates the benefits of a carefully pipelined execution of streaming applications, focusing in particular on off-and on-chip memory accesses. As case study, we implement classic and on-chip pipelined versions of mergesort for Intel SCC and Xeon. We see how the benefits of the on-chip pipelining technique are bounded by the underlying architecture, and we explore the problem of fine tuning streaming applications for many-core architectures to optimize for energy given a throughput budget. We propose a novel methodology to compute schedules optimized for energy efficiency given a fixed throughput target. We introduce Drake, derived from Schedeval, a tool that generates pipelined applications for Many-Core architectures and allows the performance testing in time or energy of their static schedule. We show that streaming applications based on Drake compete with specialized implementations and we use Schedeval to demonstrate performance differences between schedules that are otherwise considered as equivalent by a simple model. This work has been supported in parts by CUGS (the Graduate School in Computer Science, Sweden), Vetenskapsrådet, SeRC and EU FP7 EXCESS. Department of Computer and Information ScienceLinköping University SE-581 83 Linköping, Sweden Acknowledgements I would like to thank all members of PELAB for the inspiring and stimulating working environment as well as the passionate discussions around coffee or tea. In particular, my thanks to Kristian Sandahl for his efforts at maintening a strong group culture. Warm thanks to Christoph Kessler, for all fruitful discussions and ideas when my imagination came short, for his support and patience when results came late and for entrusting me with opportunities and responsibilities that taught me valuable experiences. Many thanks to Jörg Keller for sharing his ideas in details about my work, and for always carefully reviewing and challenging any hypothesis I proposed, as it repeatedly resulted in strengthening good ideas and discarding bad ones. I would like to thank all students who participated in the work, providing a precious help to progress. I am grateful to Intel for providing the Single Chip Cloud computer research prototype and their efforts to help in the numerous moments where nothing worked. Many thanks to the National Supercomputer Center (NSC) for providing powerful computing means to run my experiments, TUS for always providing a quick and precious technica...

show abstract

Power Density-Aware Resource Management for Heterogeneous Tiled Multicores

Cited by 58 publications

References 27 publications

Exploring Shared Virtual Memory for FPGA Accelerators with a Configurable IOMMU

Exploring Shared Virtual Memory for FPGA Accelerators with a Configurable IOMMU

Prediction horizon vs. efficiency of optimal dynamic thermal control policies in HPC nodes

Algorithms and Framework for Energy Efficient Parallel Stream Computing on Many-Core Architectures

Contact Info

Product

Resources

About