CATA: Criticality Aware Task Acceleration for Multicore Processors

Casas

et al. 2021

IEEE Trans. Comput.

Self Cite

Current microprocessors include several knobs to modify the hardware behavior in order to improve performance, power, and energy under different workload demands. An impractical and time consuming offline profiling is needed to evaluate the design space to find the optimal knob configuration. Different knobs are typically configured in a decoupled manner to avoid the time-consuming offline profiling process. This can often lead to underperforming configurations and conflicting decisions that jeopardize system power-performance efficiency. Thus, a dynamic management of the different hardware knobs is necessary to find the knob configuration that maximizes system power-performance efficiency without the burden of offline profiling. In this paper, we propose libPRISM, an infrastructure that enables the transparent management of multiple hardware knobs in order to adapt the system to the evolving demands of hardware resources in different workloads. libPRISM can minimize execution time, energy-delay product or power consumption by dynamically managing the SMT level, the data prefetcher, and the DVFS hardware knobs. Overall, the proposed solutions increase performance up to 130% (16.9% on average), reduce energy-delay product up to 80%, and reduce power consumption up to 33% depending on the target metric compared to the default knob configuration of the system.

Section: Dynamic Voltage and Frequency Scalingmentioning

confidence: 99%

Intelligent Adaptation of Hardware Knobs for Improving Performance and Power Consumption

Ortega

Casas

et al. 2021

IEEE Trans. Comput.

Self Cite

“…The runtime system dynamically schedules tasks when all their inputs are ready and, when the execution of a task finishes, its outputs become ready for the next tasks. This model decouples the hardware from the application, enabling many optimizations at the runtime system level in a generic and application-agnostic way [2,10,11,25,30,41].…”

Section: Task-based Programming Modelsmentioning

confidence: 99%

“…With this information the runtime system manages the parallel execution following a data-flow scheme, scheduling tasks to cores and taking care of synchronization between tasks. Decoupling the application from the architecture eases programmability and allows to leverage the runtime system information to drive optimizations in a generic and application-agnostic way [2,10,11,25,30,41].…”

Section: Introductionmentioning

confidence: 99%

Runtime-Guided Management of Stacked DRAM Memories in Task Parallel Programs

Proceedings of the 2018 International Conference on Supercomputing

Casas

Labarta

et al. 2018

Self Cite

Stacked DRAM memories have become a reality in High-Performance Computing (HPC) architectures. These memories provide much higher bandwidth while consuming less power than traditional off-chip memories, but their limited memory capacity is insufficient for modern HPC systems. For this reason, both stacked DRAM and off-chip memories are expected to co-exist in HPC architectures, giving raise to different approaches for architecting the stacked DRAM in the system. This paper proposes a runtime approach to transparently manage stacked DRAM memories in task-based programming models. In this approach the runtime system is in charge of copying the data accessed by the tasks to the stacked DRAM, without any complex hardware support nor modifications to the application code. To mitigate the cost of copying data between the stacked DRAM and the off-chip memory, the proposal includes an optimization to parallelize the copies across idle or additional helper threads. In addition, the runtime system is aware of the reuse pattern of the data accessed by the tasks, and can exploit this information to avoid unworthy copies of data to the stacked DRAM. Results on the Intel Knights Landing processor show that the proposed techniques achieve an average speedup of 14% against the stateof-the-art library to manage the stacked DRAM and 29% against a stacked DRAM architected as a hardware cache. CCS CONCEPTS • Hardware → Memory and dense storage; • Software and its engineering → Memory management;

“…The input and output information allows the runtime system to transparently manage GPUs [47], [58], stacked DRAM memories [59], multi-node clusters [60], and scratchpad memories [61]. With some additional hardware support, the runtime system can do value approximation [62], software-guided prefetching [13], dead block prediction [16], accelerate critical tasks [63], reduce coherence traffic in CC-NUMA systems [64], [65], and optimise communications in producer-consumer task relationships [66].…”

Section: Task-based Programming Modelsmentioning

confidence: 99%

Runtime-Assisted Cache Coherence Deactivation in Task Parallel Programs

Caheny

SC18: International Conference for High Performance Computing, Networking, Storage and Analysis

Valero

et al. 2018

Self Cite

With increasing core counts, the scalability of directory-based cache coherence has become a challenging problem. To reduce the area and power needs of the directory, recent proposals reduce its size by classifying data as private or shared, and disable coherence for private data. However, existing classification methods suffer from inaccuracies and require complex hardware support with limited scalability. This paper proposes a hardware/software co-designed approach: the runtime system identifies data that is guaranteed by the programming model semantics to not require coherence and notifies the microarchitecture. The microarchitecture deactivates coherence for this private data and powers off unused directory capacity. Our proposal reduces directory accesses to just 26% of the baseline system, and supports a 64× smaller directory with only 2.8% performance degradation. By dynamically calibrating the directory size our proposal saves 86% of dynamic energy consumption in the directory without harming performance.