Contention on the shared Last-Level Cache (LLC) can have a fundamental negative impact on the performance of applications executed on modern multicores. An interesting software approach to address LLC contention issues is based on page coloring, which is a software technique that attempts to achieve performance isolation by partitioning a shared cache through careful memory management. The key assumption of traditional page coloring is that the cache is physically addressed. However, recent multicore architectures (e.g., Intel Sandy Bridge and later) switched from a physical addressing scheme to a more complex scheme that involves a hash function. Traditional page coloring is ineffective on these recent architectures. In this article, we extend page coloring to work on these recent architectures by proposing a mechanism able to handle their hash-based LLC addressing scheme. Just as for traditional page coloring, the goal of this new mechanism is to deliver performance isolation by avoiding contention on the LLC, thus enabling predictable performance. We implement this mechanism in the Linux kernel, and evaluate it using several benchmarks from the SPEC CPU2006 and PARSEC 3.0 suites. Our results show that our solution is able to deliver performance isolation to concurrently running applications by enforcing partitioning of a Sandy Bridge LLC, which traditional page coloring techniques are not able to handle. CCS Concepts: r Computer systems organization → Multicore architectures; r Software and its engineering → Main memory;
Convolutional Neural Networks (CNNs) are at the base of many applications, both in embedded and in serverclass contexts. While Graphics Processing Units (GPUs) are predominantly used for training, solutions for inference often rely on Field Programmable Gate Arrays (FPGAs) since they are more flexible and cost-efficient in many scenarios. However, existing approaches fall short to accomplish several conflicting goals, like efficiently using resources on multiple platforms while retaining deep configurability and allowing a quick Design Space Exploration (DSE) towards the best solution. This paper proposes a solution composed of highly configurable kernels designed for resources time-sharing with an analytical model of their resource/performance characteristics. Building on such models, we propose an Integer Linear Programming (ILP)-based approach to effectively identify pareto optimal kernel configurations in terms of throughput and resource consumption. We evaluate our DSE on two state-of-the-art CNNs, showing how it identifies hundreds of pareto optimal solutions in less than a minute. Guided from the DSE configurations of the AlexNet network, we quickly identified a candidate design for a Xilinx Virtex-7 XC7VX485T FPGA and achieved a peak throughput of 4.05 ms per image, while we measured a maximum estimation error of 6.69% with respect to the proposed analytical models.
Motivation and ContributionThe commodity multicores that power cloud infrastructures hide memory latency through deep memory hierarchies, with the lastlevel cache (LLC) usually shared among cores. While a shared LLC improves utilization of on-chip resources, it may also lead to unpredictable performance of colocated virtual machines (VMs) as a result of unanticipated contention. Past research showed that the operating system page allocator can favor performance predictability on a physically-addressed shared LLC through page coloring [4,8,9]: a software technique that can work on commodity multicores, unlike hardware approaches [2,7]. The main drawback of page coloring is the high cost of modifying allocations (i.e., recoloring), making this technique almost impractical for applications with varying memory footprints [6].We aim at finding the simplest technique applicable to commodity multicores to avoid unpredictable performance of co-located VMs. Since VMs have a bounded memory footprint (i.e., stated at deployment time), we can leverage page coloring and avoid the cost of recoloring. We designed and implemented Rainbow: a page allocator exploiting page coloring to expose a new VM configuration knob: cache allocation, which implicitly determines proportional memory allocation. This knob provides users with predictable performance and clouds with a safe way of co-locating VMs.
Fig. 1. Slowdown of bzip co-located with different applicationsAbstract-Multi and many-core processors have emerged as the dominant solution for processing in the whole range of computer system, from small devices to large-scale installations. Chip multi-processors, which are homogeneous, multi and manycore processors, offer an unprecedented amount of on-chip, shared resources and brings a unique set of challenges. Given the importance of the Last-Level Cache management techniques to achieve near-perfect isolation, we survey the state of the art and propose research directions to address the most pressing issues in modern computer systems. To better understand the various research directions in the field, we propose a classification of the presented techniques. Finally, we discuss possible research directions.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.