Ab Al-Hadi Ab Rahman scite author profile

Future many-core systems need to handle high power density and chip temperature effectively. Some cores in many-core systems need to be turned off or ‘dark’ to manage chip power and thermal density. This phenomenon is also known as the dark silicon problem. This problem prevents many-core systems from utilizing and gaining improved performance from a large number of processing cores. This paper presents a dynamic thermal-aware performance optimization of dark silicon many-core systems (DTaPO) technique for optimizing dark silicon a many-core system performance under temperature constraint. The proposed technique utilizes both task migration and dynamic voltage frequency scaling (DVFS) for optimizing the performance of a many-core system while keeping system temperature in a safe operating limit. Task migration puts hot cores in low-power states and moves tasks to cooler dark cores to aggressively reduce chip temperature while maintaining high overall system performance. To reduce task migration overhead due to cold start, the source core (i.e., active core) keeps its L2 cache content during the initial migration phase. The destination core (i.e., dark core) can access it to reduce the impact of cold start misses. Moreover, the proposed technique limits tasks migration among cores that share the last level cache (LLC). In the case of major thermal violation and no cooler cores being available, DVFS is used to reduce the hot cores temperature gradually by reducing their frequency. Experimental results for different threshold temperatures show that DTaPO can keep the average system temperature below the thermal limit. Affirmatively, the execution time penalty is reduced by up to 18% compared with using only DVFS for all thermal thresholds. Moreover, the average peak temperature is reduced by up to 10.8°C. In addition, the experimental results show that DTaPO improves the system’s performance by up to 80% compared to optimal sprinting patterns (OSP) and reduces the temperature by up to 13.6°C.

show abstract

Pipeline synthesis and optimization of FPGA-based video processing applications with CAL

Rahman

Prihozhy

Mattavelli

2011

J Image Video Proc.

View full text Add to dashboard Cite

This article describes a pipeline synthesis and optimization technique that increases data throughput of FPGAbased system using minimum pipeline resources. The technique is applied on CAL dataflow language, and designed based on relations, matrices, and graphs. First, the initial as-soon-as-possible (ASAP) and as-late-aspossible (ALAP) schedules, and the corresponding mobility of operators are generated. From this, operator coloring technique is used on conflict and nonconflict directed graphs using recursive functions and explicit stack mechanisms. For each feasible number of pipeline stages, a pipeline schedule with minimum total register width is taken as an optimal coloring, which is then automatically transformed to a description in CAL. The generated pipelined CAL descriptions are finally synthesized to hardware description languages for FPGA implementation. Experimental results of three video processing applications demonstrate up to 3.9× higher throughput for pipelined compared to non-pipelined implementations, and average total pipeline register width reduction of up to 39.6 and 49.9% between the optimal, and ASAP and ALAP pipeline schedules, respectively.

show abstract

VLSI Design of a Fast Pipelined 8x8 Discrete Cosine Transform

Zabidi

Rahman

2017

IJECE

View full text Add to dashboard Cite

This paper presents a Very Large Scale Integrated (VLSI) design and implementation of a fixed-point 8x8 multiplierless Discrete Cosine Transform (DCT) using the ISO/IEC 23002-2 algorithm. The standard DCT algorithm, which is mainly used in image and video compression technology, consists of only adders, subtractors, and shifters, therefore making it efficient for hardware implementation. The VLSI implementation of the algorithm given in this paper further enhances the performance of the transform unit. Furthermore, circuit pipelining has been applied to the base design of the DCT, which significantly improves the performance by reducing the longest path in the non-pipeline design. The DCT has been implemented using semi-custom VLSI design methodology using the TSMC 0.13um process technology. Results show that our DCT designs can run up to around 1.7 Giga pixels/s, which is well above the timing required for real-time ultra-high definition 8K video.Copyright c 2017 Institute of Advanced Engineering and Science.All rights reserved. DCT basis function is the Cosine, where multiplication and addition are the main arithmetic operations involved. Many DCT-based research has been conducted in the past few years, which has produced different kind of DCT Algorithms, such as Arai DCT scheme, Wang Factorization, Lee DCT for power of two block length, Loeffler algorithm, and Feig-Winograd factorization ([2]). These Algorithms have been used in practical applications. In recent image processing technology, various hardware implementation of DCT are using Arai DCT scheme [3]. It uses only five multiplications and twenty-nine addition, which is less arithmetic operations if compared to other stated algorithms. For the MPEG technology, the International Standards Organization (ISO) released an optimized fixedpoint multiplierless version of the DCT algorithm, suitable for image and video compression. The standard which is called the ISO/IEC 23002-2 is described and implemented in the present work [4].VLSI design of DCT can be found in numerous articles, with an overview given in [5]. For comparison purposes, we have analyzed three similar designs. The work by Mandayake et al in [6] presents a VLSI architecture of the DCT using the Arai DCT scheme. It proposes a fast algorithm by reducing the number of integer channels. The design is implemented using 45nm technology. The work by Wahid et al [7] proposes an area efficient fixed point DCT architecture implemented in 0.18 um CMOS technology. Another interesting work is by Fu et al [8], where a low power implementation is proposed based on algebraic integer encoding technique. This work also utilizes 0.18um CMOS technology. Performance results for these works are given in the results section.The present paper on the other hand, describes the semi-custom Very Large Scale Integration (VLSI) design of the ISO/IEC 23002-2 DCT algorithm using TSMC 0.13um technology, similar to the design methodology used in

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.