The most common task of GPUs is to render images in real time. When rendering a 3D scene, a key step is to determine which parts of every object are visible in the final image. There are different approaches to solve the visibility problem, the Z-Test being the most common. A main factor that significantly penalizes the energy efficiency of a GPU, especially in the mobile arena, is the so-called overdraw, which happens when a portion of an object is shaded and rendered but finally occluded by another object. This useless work results in a waste of energy; however, a conventional Z-Test only avoids a fraction of it. In this paper we present a novel microarchitectural technique, the Ω-Test, to drastically reduce the overdraw on a Tile-Based Rendering (TBR) architecture. Graphics applications have a great degree of inter-frame coherence, which makes the output of a frame very similar to the previous one. The proposed approach leverages the frame-to-frame coherence by using the resulting information of the Z-Test for a tile (a buffer containing all the calculated pixel depths for a tile), which is discarded by nowadays GPUs, to predict the visibility of the same tile in the next frame. As a result, the Ω-Test early identifies occluded parts of the scene and avoids the rendering of non-visible surfaces eliminating costly computations and off-chip memory accesses. Our experimental evaluation shows average EDP savings in the overall GPU/Memory system of 26.4% and an average speedup of 16.3% for the evaluated benchmarks.
Modern mobile GPUs integrate an increasing number of shader cores to speedup the execution of graphics workloads. Each core integrates a private Texture Cache to apply texturing effects on objects, which is backed-up by a shared L2 cache. However, as in any other memory hierarchy, such organization produces data replication in the upper levels (i.e., the private Texture Caches) to allow for faster accesses at the expense of reducing their overall effective capacity. E.g., in a mobile GPU with four shader cores, about 84.6% of the requested texture blocks are replicated in at least one of the other private Texture Caches.This paper proposes a novel dynamically-mapped Non-Uniform Cache Architecture (NUCA) organization for the private Texture Caches of a mobile GPU aimed at increasing their effective overall capacity and decreasing the overall access latency by attacking data replication. A block missing in a local Texture Cache may be serviced by a remote one at a cost smaller than a round trip to the shared L2. The proposed Dynamic Texture Mapping-NUCA (DTM-NUCA) features a lightweight mapping table, called Affinity Table, that is independent of the L2 cache size, unlike a traditional NUCA organization. The best owner for a given set of blocks is dynamically determined and stored in the Affinity Table to maximize local accesses. The mechanism also allows for a certain amount of replication to favor local accesses where appropriate, without hurting performance due to the small capacity loss resulting from the allowed replication. DTM-NUCA is presented in two flavors. One with a centralized Affinity Table, and another with a distributed Affinity Table . Experimental results show first that the L2 pressure is effectively reduced, eliminating 41.8% of the L2 accesses on average. As for the average latency, DTM-NUCA performs a very effective job at maximizing local over remote accesses, achieving 73.8% of local accesses on average. As a consequence, our novel DTM-NUCA organization obtains an average speedup of 16.9% and overall 7.6% energy savings over a conventional organization.
This paper proposes a novel micro-architecture approach for mobile GPUs aimed at early removing the occluded geometry in a scene by leveraging frame-to-frame coherence, thus reducing the overall energy consumption. Mobile GPUs commonly implement a Tile-Based Rendering (TBR) architecture which differentiates two main phases: the Geometry Pipeline , where all the geometry of a scene is processed; and the Raster Pipeline , where primitives are rendered in a framebuffer. After the Geometry Pipeline, only non-culled primitives inside the camera’s frustum are stored into the Parameter Buffer , a data structure stored in DRAM. However, among the non-culled primitives there is a significant amount that are rendered but non-visible at all , resulting in useless computations. On average, 60% of those primitives are completely occluded in our benchmarks. Despite TBR architectures use on-chip caches for the Parameter Buffer, about 46% of the DRAM traffic still comes from accesses to such buffer. The proposed Triangle Dropping technique leverages the visibility information computed along the Raster Pipeline to predict the primitives’ visibility in the next frame to early discard those that will be totally occluded, drastically reducing Parameter Buffer accesses. On average, our approach achieves overall 14.5% energy savings, 28.2% energy-delay product savings and a speedup of 20.2%.
An important drawback of cycle-accurate microarchitectural simulators is that they are several orders of magnitude slower than the system they model. This becomes an important issue when simulations have to be repeated multiple times sweeping over the desired design space. In the specific context of graphics workloads, performing cycle-accurate simulations are even more demanding due to the high number of triangles that have to be shaded, lighted and textured to compose a single frame. As a result, simulating a few minutes of a video game sequence is extremely time-consuming.In this paper, we make the observation that collecting information about the vertices and primitives that are processed, along with the times that shader programs are invoked, allows us to characterize the activity performed on a given frame. Based on that, we propose a novel methodology for the efficient simulation of graphics workloads called MEGsim, an approach that is capable of accurately characterizing entire video sequences by using a small subset of selected frames which substantially drops the simulation time. For a set of popular Android games, we show that MEGsim achieves an average simulation speedup of 126×, achieving remarkably accurate results for the estimated final statistics, e.g., with average relative errors of just 0.84% for the total number of cycles, 0.99% for the number of DRAM accesses, 1.2% for the number of L2 cache accesses, and 0.86% for the number of L1 (tile cache) accesses.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.