For many-core architectures like the GPUs, efficient off-chip memory access is crucial to high performance; the applications are often limited by off-chip memory bandwidth. Transforming data layout is an effective way to reshape the access patterns to improve off-chip memory access behavior, but several challenges had limited the use of automated data layout transformation systems on GPUs, namely how to efficiently handle arrays of aggregates, and transparently marshal data between layouts required by different performance sensitive kernels and legacy host code. While GPUs have higher memory bandwidth and are natural candidates for marshaling data between layouts, the relatively constrained GPU memory capacity, compared to that of the CPU, implies that not only the temporal cost of marshaling but also the spatial overhead must be considered for any practical layout transformation systems. This paper presents DL, a practical GPU data layout transformation system that addresses these problems: first, a novel approach to laying out array of aggregate types across GPU and CPU architectures is proposed to further improve memory parallelism and kernel performance beyond what is achieved by human programmers using discrete arrays today. Our proposed new layout can be derived in situ from the traditional Array of Structure, Structure of Arrays, and adjacent Discrete Arrays layouts used by programmers. Second, DL has a run-time library implemented in OpenCL that transparently and efficiently converts, or marshals, data to accommodate application components that have different data layout requirements. We present insights that lead to the design of this highly efficient run-time marshaling library. In particular, the in situ transformation implemented in the library is comparable or faster than optimized traditional out-of-place transformations while avoiding doubling the GPU DRAM usage. Third, we show experimental results that the new layout approach leads to substantial performance improvement at the applications level even when all marshaling cost is taken into account.
It is unquestionable that successive hardware generations have significantly improved CPU computing workload per formance over the last several years. Moore's law and DRAM scaling have respectively increased single-chip peak instruc tion throughput by 3X and off-chip bandwidth by 2.2X from NVIDIA's CeForce 8800 CTX in November 2006 to its CeForce CTX 580 in November 2010. However, raw capability num bers typically underestimate the improvements in real ap plication performance over the same time period, due to significant architectural feature improvements.To demonstrate the effects of architecture features and optimizations over time, we conducted experiments on a set of benchmarks from diverse application domains for multi ple CPU architecture generations to understand how much performance has truly been improving for those workloads. First, we demonstrate that certain architectural features make a huge difference in the performance of unoptimized code, such as the inclusion of a general cache which can improve performance by 2-4x in some situations. Second, we describe what optimization patterns have been most es sential and widely applicable for improving performance for CPU computing workloads across all architecture genera tions. Some important optimization patterns included data layout transformation, converting scatter accesses to gather accesses, CPU workload regularization, and granularity coars ening, each of which improved performance on some bench mark by over 20%, sometimes by a factor of more than 5x . While hardware improvements to baseline unoptimized code can reduce the speedup magnitude, these patterns remain important for even the most recent CPUs. Finally, we iden tify which added architectural features created significant new optimization opportunities, such as increased register file capacity or reduced bandwidth penalties for misaligned accesses, which increase performance by 2 x or more in the optimized versions of relevant benchmarks. While no community or field springs out of nothing, the modern field of CPU computing had a major inflection point approximately five years ago with the first support for C based programming languages for general computation on CPUs. Ve ry quickly, the community discovered and pub lished what worked well on CPU platforms and what didn't at first. As the years progressed, CPU architects and ap plication researchers continually pushed at the boundaries of what CPUs could do effectively, significantly improving performance for many workloads. At this juncture, with five years of experience and a new academic conference ex plicitly dedicated to novel parallel computing platforms and applications, we would like to examine some CPU comput ing workloads and see how far we have come, and what is most important to learn from the current state of the art as we continue to move forward.We would like to focus on two major aspects of the CPU computing field over the last five years. The first is the op timization and programming patterns that have shaped op timized app...
Matrix transposition is an important algorithmic building block for many numeric algorithms such as FFT. It has also been used to convert the storage layout of arrays. With more and more algebra libraries offloaded to GPUs, a high performance in-place transposition becomes necessary. Intuitively, in-place transposition should be a good fit for GPU architectures due to limited available on-board memory capacity and high throughput. However, direct application of CPU in-place transposition algorithms lacks the amount of parallelism and locality required by GPUs to achieve good performance. In this paper we present the first known in-place matrix transposition approach for the GPUs. Our implementation is based on a novel 3-stage transposition algorithm where each stage is performed using an elementary tiled-wise transposition. Additionally, when transposition is done as part of the memory transfer between GPU and host, our staged approach allows hiding transposition overhead by overlap with PCIe transfer. We show that the 3-stage algorithm allows larger tiles and achieves 3X speedup over a traditional 4-stage algorithm, with both algorithms based on our high-performance elementary transpositions on the GPU. We also show our proposed low-level optimizations improve the sustained throughput to more than 20 GB/s. Finally, we propose an asynchronous execution scheme that allows CPU threads to delegate in-place matrix transposition to GPU, achieving a throughput of more than 3.4 GB/s (including data transfers costs), and improving current multithreaded implementations of in-place transposition on CPU.
Published by the IEEE Computer Society 0018-9162/12/$31.00 © 2012 IEEE Cover Fe ature memories can manage locality. The challenge is that these techniques require per-thread on-chip memory resources, which are decreasing in massively threaded processors and are predicted to continue to do so. 1 As programmers face these challenges in more applications, there is an increasing demand for best practices for achieving good scaling. We conducted a survey of the field through our review of 75 application articles for the GPU Computing Gems series 2,3 and while developing the Parboil accelerator benchmark suite. 4 Here, our focus is on choosing algorithms with low computational complexity. In addition, we do not include many commonplace optimizations that we believe do not directly affect inherent scalability. Several patterns emerged from our survey, each of which we generalize here as a "technique."For each technique that we describe, we implemented a version of at least one of the Parboil benchmarks that lacked that technique but was otherwise well optimized, compared to the fastest implementation currently known and available to us. Unless otherwise noted, we collected the performance results on an Nvidia GeForce 480 GTX. Since we are focusing on GPU scalability, we only compare kernel execution times, avoiding any assumptions about data transmission costs. TECHNIQUES FOR SCALABLE PERFORMANCEThe disparity between off-chip data access bandwidth and a massively threaded system's ability to consume that R ecent many-core processors support hundreds of hardware threads and require thousands of tasks in applications to fully utilize their execution throughput. Scaling to such large numbers of threads requires more than just identifying and expressing parallelism. Each of those thousands of tasks will require data bandwidth, an increasingly limited resource in comparison to the compute throughput capabilities of high-performance systems. For many applications, threads also need mediated access to some shared data accumulating their results. Massively threaded commodity many-core processors introduce the challenge of parallel performance scalability to the mainstream.Programmers can overcome such hurdles to achieving scalable performance by adjusting algorithms to rely more on on-chip and thread-private storage, economizing on off-chip memory traffic. Caches or other on-chip A study of the implementation patterns among massively threaded applications for many-core GPUs reveals that each of the seven most commonly used algorithm and data optimization techniques can enhance the performance of applicable kernels by 2 to 10× in current processors while also improving future scalability.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.