Jesús Alastruey-Benedé scite author profile

Abstract-Scaling supply voltage to values near the threshold voltage allows a dramatic decrease in the power consumption of processors; however, the lower the voltage, the higher the sensitivity to process variation, and, hence, the lower the reliability. Large SRAM structures, like the last-level cache (LLC), are extremely vulnerable to process variation because they are aggressively sized to satisfy high density requirements. In this paper, we propose Concertina, an LLC designed to enable reliable operation at low voltages with conventional SRAM cells. Based on the observation that for many applications the LLC contains large amounts of null data, Concertina compresses cache blocks in order that they can be allocated to cache entries with faulty cells, enabling use of 100% of the LLC capacity. To distribute blocks among cache entries, Concertina implements a compression-and fault-aware insertion/replacement policy that reduces the LLC miss rate. Concertina reaches the performance of an ideal system implementing an LLC that does not suffer from parameter variation with a modest storage overhead. Specifically, performance degrades by less than 2%, even when using small SRAM cells, which implies over 90% of cache entries having defective cells, and this represents a notable improvement on previously proposed techniques.

show abstract

Developing an AI IoT application with open software on a RISC-V SoC

Torres-Sanchez

Alastruey-Benedé

Torres-Moreno

2020

View full text Add to dashboard Cite

RISC-V is an emergent architecture that is gaining strength in low-power IoT applications. The stabilization of the architectural extensions and the start of commercialization of RISC-V based SOCs, like the Kendryte K210, raises the question of whether this open standard will facilitate the development of applications in specific markets or not.In this paper we evaluate the development environments, the toolchain, the debugging processes related to the Sipeed MAIX Go development board, as well as the standalone SDK and the Micropython port for the Kendryte K210. The training pipeline for the built-in convolutional neural network accelerator, with support for Tiny YOLO v2, has also been studied. In order to evaluate all the above aspects in depth, two low-cost, low-power, IoT edge applications based on AI have been developed. The first one is capable of recognizing movement in a house and autonomously identify whether it was caused by a human or by a house pet, like for example a dog or a cat. In the context of the current COVID-19 pandemic, the second application is capable of labeling whether a pedestrian is wearing a face mask or not, doing real-time object recognition at a mean rate of 13 FPS. Throughout the process, we can conclude that, despite the potential of the hardware and its excellent performance/cost ratio, the documentation for developers is scarce, the development environments are in low maturity levels, and the debugging processes are sometimes nonexistent.

show abstract

Memory hierarchy characterization of SPEC CPU2006 and SPEC CPU2017 on the Intel Xeon Skylake-SP

et al. 2019

View full text Add to dashboard Cite

SPEC CPU is one of the most common benchmark suites used in computer architecture research. CPU2017 has recently been released to replace CPU2006. In this paper we present a detailed evaluation of the memory hierarchy performance for both the CPU2006 and single-threaded CPU2017 benchmarks. The experiments were executed on an Intel Xeon Skylake-SP, which is the first Intel processor to implement a mostly non-inclusive last-level cache (LLC). We present a classification of the benchmarks according to their memory pressure and analyze the performance impact of different LLC sizes. We also test all the hardware prefetchers showing they improve performance in most of the benchmarks. After comprehensive experimentation, we can highlight the following conclusions: i) almost half of SPEC CPU benchmarks have very low miss ratios in the second and third level caches, even with small LLC sizes and without hardware prefetching, ii) overall, the SPEC CPU2017 benchmarks demand even less memory hierarchy resources than the SPEC CPU2006 ones, iii) hardware prefetching is very effective in reducing LLC misses for most benchmarks, even with the smallest LLC size, and iv) from the memory hierarchy standpoint the methodologies commonly used to select benchmarks or simulation points do not guarantee representative workloads.

show abstract

Accelerating Sparse Arithmetic in the Context of Newton’s Method for Small Molecules with Bond Constraints

Mikkelsen

Alastruey-Benedé

Ibáñez

et al. 2016

View full text Add to dashboard Cite

Abstract. Molecular dynamics is used to study the time evolution of systems of atoms. It is common to constrain bond lengths in order to increase the time step of the simulation. Here we accelerate Newton's method for solving the constraint equations for a system consisting of many identical small molecules. Starting with a modular and generic base code using a sequential data layout, we apply three different optimization techniques. The compiled code approach is used to generate subroutines equivalent to a single step of Newton's method for a user specified molecule. Differing from the generic subroutines, these specific routines contain no loops and no indirect addressing. Interleaving the data describing different molecules generates vectorizable loops. Finally, we apply task fusion. The simultaneous application of all three techniques increases the speed of the base code by a factor of 15 for single precision calculations.

show abstract

Accelerating Sequence Alignments Based on FM-Index Using the Intel KNL Processor

Herruzo

González-Navarro

Ibáñez

et al. 2020

IEEE/ACM Trans. Comput. Biol. and Bioinf.

View full text Add to dashboard Cite

FM-index is a compact data structure suitable for fast matches of short reads to large reference genomes. The matching algorithm using this index exhibits irregular memory access patterns that cause frequent cache misses, resulting in a memory bound problem. This paper analyzes different FM-index versions presented in the literature, focusing on those computing aspects related to the data access. As a result of the analysis, we propose a new organization of FM-index that minimizes the demand for memory bandwidth, allowing a great improvement of performance on processors with high-bandwidth memory, such as the second-generation Intel Xeon Phi (Knights Landing, or KNL), integrating ultra high-bandwidth stacked memory technology. As the roofline model shows, our implementation reaches 95% of the peak random access bandwidth limit when executed on the KNL and almost all the available bandwidth when executed on other Intel Xeon architectures with conventional DDR memory. In addition, the obtained throughput in KNL is much higher than the results reported for GPUs in the literature.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.