Abstract-Thermal stress including temperature gradients in time and space, as well as thermal cycling, influences lifetime reliability and performance of modern Multiprocessor Systemson-Chip (MPSoCs). Conventional power and temperature management techniques considering the peak temperature/power consumption do not provide a comprehensive solution to avoid high spatial and temporal thermal variations. This work presents TheSPoT, a novel multi-level thermal stress-aware power and thermal management approach for MPSoCs. At the top level, core consolidation and deconsolidation is performed based on peak temperature, thermal stress, and power consumption constraints. These constraints are also used at the next level, where operating frequencies are determined. At this level we obtain optimal core frequencies by solving a convex optimization problem. However, thereafter, to reduce the runtime overhead in large MPSoCs, we alternatively propose to use a fast heuristic algorithm. The efficacy of the proposed approaches in reducing the thermal cycles and temporal/spatial temperature gradients is evaluated by comparing the results with the state-of-the-art methods. The evaluation performed on 4-core, 8-core, and 16-core MPSoCs, using PARSEC benchmarks, reveals a considerable reduction in thermal stress. For the 8-core MPSoC case study, on average, for the proposed heuristic(optimal) approach, the mean time to failure improved by 47(35) % compared to the state-of-the-art techniques with only 6(4) % performance degradation. Also, our simulations show that TheSPoT is more efficient in thermal stress reduction when more heterogeneous workloads are used.
Run-time profiling of software applications is key to energy efficiency. Even the most optimized hardware combined to an optimally designed software may become inefficient if operated poorly. Moreover, the diversification of modern computing platforms and broadening of their run-time configuration space make the task of optimally operating software ever more complex. With the growing financial and environmental impact of data center operation and cloud-based applications, optimal software operation becomes increasingly more relevant to existing and next-generation workloads. In order to guide software operation towards energy savings, energy and performance data must be gathered to provide a meaningful assessment of the application behavior under different system configurations, which is not appropriately addressed in existing tools. In this work we present Containergy, a new performance evaluation and profiling tool that uses software containers to perform application run-time assessment, providing energy and performance profiling data with negligible overhead (below 2%). It is focused on energy efficiency for next generation workloads. Practical experiments with emerging workloads, such as video transcoding and machine-learning image classification, are presented. The profiling results are analyzed in terms of performance and energy savings under a Quality-of-Service (QoS) perspective. For video transcoding, we verified that wrong choices in the configuration space can lead to an increase above 300% in energy consumption for the same task and operational levels. Considering the image classification case study, the results show that the choice of the machine-learning algorithm and model affect significantly the energy efficiency. Profiling datasets of AlexNet and SqueezeNet, which present similar accuracy, indicate that the latter represents 55.8% in energy saving compared to the former.
In this work, we propose a power and thermal management algorithm based on machine learning to control the thermal stresses and power consumption of the heterogeneous MPSoCs. The objectives of the proposed algorithm are increasing the performance and decreasing the spatial and temporal temperature gradients along with the thermal cycling under the power and temperature constraints. Our proposed power and thermal management method is based on a heuristic approach to speed up the convergence of the machine learning algorithm which makes it applicable for general purpose processors. Adopting Q-Learning as the machine learning algorithm, the heuristic approach aids to limit the learning space by suggesting the most appropriate actions to the agent in each decision epoch. The heuristic algorithm employs the current and previous states of the machine learning, as well as the amount of the temperature stress and power consumption of each core to determine the appropriate action for each core, independently. The proposed algorithm is evaluated on 4-core, 8-core and 16-core homogeneous and heterogeneous MPSoCs for some benchmarks in the Splash2 benchmark package. The results reveal a faster convergence of machine learning and more thermal stresses reduction.
Next-generation High-Performance Computing (HPC) systems need to provide outstanding performance with unprecedented energy efficiency while maintaining servers at safe thermal conditions. Air cooling presents important limitations when employed in HPC infrastructures. Instead, two-phase onchip cooling combines small footprint area and large heat exchange surface of micro-channels together with extremely high heat transfer performance, and allows for waste heat recovery. When relying on gravity to drive the flow to the heat sink, the system is called a closed-loop two-phase thermosyphon. Previous research work either focused on the development of large-scale proof-of-concept thermosyphon demonstrators, or on the development of numerical models able to simulate their operation. In this work, we present a new ultra-compact microscale thermosyphon design for high heat flux components. We manufactured a working 8 cm height prototype tailored for Virtex 7 FPGAs with a heat spreader area of 45 mm × 45 mm, and we validate its performance via measurements. The results are compared to our simulator and accurately match the thermal performance of the thermosyphon, with error of less than 3.5% . Our prototype is able to work over the full range of power of the Virtex7, dissipating up to 60 W of power while keeping chip temperature below 60 • C. The prototype will next be deployed in a 10 kW rack as part of an HPC prototype, with an expected Power Usage Effectiveness (PUE) below 1.05.
Real-time video transcoding has recently raised as a valid alternative to address the ever-increasing demand for video contents in servers' infrastructures in current multiuser environments. High Efficiency Video Coding (HEVC) makes efficient online transcoding feasible as it enhances user experience by providing the adequate video configuration, reduces pressure on the network, and minimizes inefficient and costly video storage. However, the computational complexity of HEVC, together with its myriad of configuration parameters, raises challenges for power management, throughput control, and Quality of Service (QoS) satisfaction. This is particularly challenging in multiuser environments where multiple users with different resolution demands and bandwidth constraints need to be served simultaneously. In this work, we present MAMUT, a multiagent machine learning approach to tackle these challenges. Our proposal breaks the design space composed of run-time adaptation of the transcoder and system parameters into smaller sub-spaces that can be explored in a reasonable time by individual agents. While working cooperatively, each agent is in charge of learning and applying the optimal values for internal HEVC and system-wide parameters. In particular, MAMUT dynamically tunes Quantization Parameter, selects number of threads per video, and sets the operating frequency with throughput and video quality objectives under compression and power consumption constraints. We implement MAMUT on an enterprise multicore server and compare equivalent scenarios to state-ofthe-art alternative approaches. The obtained results reveal that MAMUT consistently attains up to 8x improvement in terms of FPS violations (and thus Quality of Service), 24% power reduction, as well as faster and more accurate adaptation both to the video contents and available resources.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.