Vector processors are a very promising solution for mobile devices and servers due to their inherently energy-efficient way of exploiting datalevel parallelism. While vector processors succeeded in the high performance market in the past, they need a re-tailoring for the mobile market that they are entering now. Functional units are a key components of computation intensive designs like vector architectures, and have significant impact on overall performance and power. Therefore, there is a need for novel, vector-specific, design space exploration and low power techniques of vector functional units.
We present a design space exploration of vector adder (VA) and multiplier unit (VMU). We examine advantages and side effects of using multiple vector lanes and show how it performs across a broad frequency spectrum to achieve an energy-efficient speed-up. As the final results of our exploration, we derive Pareto optimal design points and present guidelines on the selection of the most appropriate VMU and VA for different types of vector processors according to different sets of metrics of interest.
To reduce the power of vector floating-point fused multiply-add units (VFU), we comprehensively identify, propose, and evaluate the most suitable clock-gating techniques for it. These techniques ensure power savings without jeopardizing the performance. We focus on unexplored opportunities for clock-gating application to vector processors, especially in active operating mode. Using vector masking and vector multilane-aware clock-gating, we report power reductions of up to 52%, assuming active VFU operating at the peak performance. Among other findings, we observe that vector instruction-based clock-gating techniques achieve power savings for all vector floating-point instructions. Finally, when evaluating all techniques together, the power reductions are up to 80%.
We propose a methodology that enables performing this research in a fully parameterizable and automated fashion using two kinds of benchmarks, synthetic and "real world" application based. For this interrelated circuit-architecture research, we present novel frameworks with both architectural- and circuit-level tools, simulators and generators (including ones that we developed). Our frameworks include both design(e.g. adder's family type) and vector architecture-related parameters (e.g. vector length).
Additionally, to find the optimal estimation flow, we perform a comparative analysis, using a design space exploration as a case study, of the currently most used estimation flows: Physical layout Aware Synthesis (PAS) and Place and Route (PnR). We study and compare post-PAS and post-PnR estimations of the metrics of interest and the impact of various design parameters and input switching activity factor (aI).