In the low-end mobile processor market, power, energy and area budgets are signi cantly lower than in the server/desktop/laptop/high-end mobile markets. It has been shown that vector processors are a highly energy-e cient way to increase performance; however adding support for them incurs area and power overheads that would not be acceptable for low-end mobile processors. In this work, we propose an integrated vector-scalar design for the ARM architecture that mostly reuses scalar hardware to support the execution of vector instructions. The key element of the design is our proposed block-based model of execution that groups vector computational instructions together to execute them in a coordinated manner. We implemented a classic vector unit and compare its results against our integrated design. Our integrated design improves the performance (more than 6x) and energy consumption (up to 5x) of a scalar in-order core with negligible area overhead (only 4.7% when using a vector register with 32 elements). In contrast, the area overhead of the classic vector unit can be signi cant (around 44%) if a dedicated vector oating-point unit is incorporated. Our block-based vector execution outperforms the classic vector unit for all kernels with oating-point data and also consumes less energy. We also complement the integrated design with three energy-performance e cient techniques that further reduce power and increase performance. The rst proposal covers the design and implementation of chaining logic that is optimized to work with the cache hierarchy through vector memory instructions, the second proposal reduces number of reads/writes from/to the vector register le while the third idea optimizes complex memory access patterns with the memory shape instruction and uni ed indexed vector load. The research leading to these results has received funding from the RoMoL ERC Advanced Grant GA no 321253 and is supported in part by the European Union (FEDER funds) under contract TIN2015-65316-P. This research has been also supported the Agency for Management of University and Research Grants (AGAUR -FI-DGR 2014). O. Palomar is funded by a Royal Society Newton International Fellowship. Current author's a liations: M. Stanic, ASML; O. Palomar, University of Manchester; T. Hayes, ARM. A. Cristal is also a liated with CSIC-IIIA and UPC. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for pro t or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior speci c permission and/or a fee. Request permissions from permissions@acm.org.
INTRODUCTIONIn the last 15 years, power dissipation and energy consumption have become crucial design concerns for almost all...