Comparing unified, pinned, and host/device memory allocations for memory‐intensive workloads on Tegra SoC

Choi, Jake; You, Hojun; Kim, Chongam; Yeom, Heon Y.; Kim, Yoonhee

doi:10.1002/cpe.6018

Cited by 8 publications

(5 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…where a n (j) = (−1) (B−1−j)/(B−1) b n (j), • is the greatest integer function and j ∈ [0, B − 1]. From (22), it is clear that filter partial products a n (j) undergoes shiftaccumulation for B number of clock cycles with sign inversion at 0 th clock cycle. The term b n (j) can be expressed as…”

Section: A Inner Product Using Tc Damentioning

confidence: 99%

“…Due to the progressive scaling of silicon devices over the past several years, semiconductor memory has become inexpensive, high-speed and power-efficient. As per the projections of the international technology roadmap for semiconductors (ITRS) [21], embedded memories will continue to dominate in system-on-chip, for instance, at present, it is roughly more than 90% of total SoC content [22]. It is found that the packing density of transistors in SRAM is not only high but also increasing much faster than the transistor density of logic devices [23].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Two Distributed Arithmetic Based High Throughput Architectures of Non-Pipelined LMS Adaptive Filters

et al. 2022

View full text Add to dashboard Cite

Distributed arithmetic (DA) is an efficient look-up table (LUT) based approach. The throughput of DA based implementation is limited by the LUT size. This paper presents two highthroughput architectures (Type I and II) of non-pipelined DA based least-mean-square (LMS) adaptive filters (ADFs) using two's complement (TC) and offset-binary coding (OBC) respectively. We formulate the LMS algorithm using the steepest descent approach with possible extension to its power-normalized LMS version and followed by its convergence properties. The coefficient update equation of LMS algorithm is then transformed via TC DA and OBC DA to design and develop non-pipelined architectures of ADFs. The proposed structures employ the LUT pre-decomposition technique to increase the throughput performance. It enables the same mapping scheme for concurrent update of the decomposed LUTs. An efficient fixedpoint quantization model for the evaluation of proposed structures from a realistic point-of-view is also presented. It is found that Type II structure provides higher throughput than Type I structure at the expense of slow convergence rate with almost the same steady-state mean square error. Unlike existing non-pipelined LMS ADFs, the proposed structures offer very high throughput performance, especially with large order DA base units. Furthermore, they are capable of performing less number of additions in every filter cycle. Based on the simulation results, it is found that 256 th order filter with 8 th order DA base unit using Type I structure provides 9.41× higher throughput while Type II structure provides 16.68× higher throughput as compared to the best existing design. Synthesis results show that 32 nd order filter with 8 th order DA base unit using Type I structure achieves 38.76% less minimum sampling period (MSP), occupies 28.62% more area, consumes 67.18% more power, utilizes 49.06% more slice LUTs and 3.31% more flip-flops (FFs), whereas Type II structure achieves 51.25% less MSP, occupies 21.42% more area, consumes 47.84% more power, utilizes 29.10% more slice LUTs and 1.47% fewer FFs as compared to the best existing design.INDEX TERMS Adaptive filter (ADF), distributed arithmetic (DA), finite-impulse response (FIR), least mean square (LMS), look-up table (LUT).

show abstract

Section: A Inner Product Using Tc Damentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Two Distributed Arithmetic Based High Throughput Architectures of Non-Pipelined LMS Adaptive Filters

et al. 2022

View full text Add to dashboard Cite

show abstract

“…In Reference 8, the authors implement several GPU applications, including a custom CFD code with unified, pinned, and normal host/device memory allocation modes. They evaluate and compare the memory usage and execution time of such workloads on edge computing Tegra system‐on‐chips (SoC) equipped with integrated GPUs using a shared memory architecture, and non‐SoC machines with discrete GPUs equipped with distinct VRAM.…”

Section: Contents Of the Special Issuementioning

confidence: 99%

Special Issue on High‐end Heterogeneous Architectures, Methodologies, and Algorithms (HHAMA20)

Kosta

Laccetti

Lapegna

et al. 2020

Concurrency and Computation

View full text Add to dashboard Cite

We are experiencing painful months, and as we write this editorial, the COVID-19 pandemic is still ongoing, severely affecting the peoples of many researchers who participated in this special issue. Our first thought is for the victims of this tragedy. Despite the difficulties, we have worked hard to keep in touch with the authors of the manuscripts, the anonymous referees, and the journal staff, so now we are especially delighted to present this special issue High-end Heterogeneous Architectures, Methodologies, and Algorithms (HHAMA-20) hosted on the Concurrency and Computation: Practice and Experience journal. The special issue collects an extension of the most valuable works presented at the Fifth Workshop on Models, Algorithms and Methodologies for Hybrid Parallelism in New HPC Systems (MAMHYP-19), held in Bialystok (Poland) in September 2019, jointly with the 13th Conference on Parallel Processing and Applied Mathematics (PPAM-19). New original papers related to the workshop themes are also included. The final aim is to provide a glimpse of the current state of knowledge related to the development of efficient methodologies and algorithms for HPC systems with multiple forms of parallelism. 1 CONTENTS OF THE SPECIAL ISSUE In this special issue, we collected nine papers, selected from 13 works submitted from eight different countries, with an acceptance rate of about 70%. Each manuscript has been accepted after two or three review rounds. In the first round, each paper has been assigned to two external anonymous reviewers. To the doubtful cases (i.e., borderline papers or with diverging recommendations), the guest editors of the special issue have been given additional individual attention. Authors of papers selected in the first round have been asked to revise their work according to the reviewers' feedback and submit a revised manuscript for a second or third review. The papers cover a broad spectrum of topics, and they could be classified into two large categories:

show abstract

“…In [13], the authors perform a similar analysis comparing the sync and async_alloc models, focusing on latency hiding and its effect on runtime performance using two GPUs. The authors in [4] make a comparison between the different CUDA communication models, but on a Tegra SoC-based system where both host and device memory is shared. None of this work, however, looks at all models or at code generation for them.…”

Section: Related Workmentioning

confidence: 99%

Effective Host-GPU Memory Management Through Code Generation

Vießmann

Scholz

2020

Proceedings of the 32nd Symposium on Implementation and Application of Functional Languages

View full text Add to dashboard Cite

NVIDIA's CUDA provides several options to orchestrate the management of host and device memory as well as the communication between them. In this paper we look at these choices, identify the program changes required when switching between them, and we observe their effect on application performance.We present code generation schemes that translate resourceagnostic program specifications, i.e., programs without any explicit notion of memory or GPU kernels, into five CUDA versions that differ in the use of the memory and communication API of CUDA only. An implementation of these code generators within the compiler of the functional programming language Single-Assignment C (SaC) shows performance differences between the variants by up to a factor of 3.Performance analyses reveal that the preferred choices depend on a combination of several factors, including the actual hardware being used, and several aspects of the application itself. A clear choice, therefore, cannot be made a priori. Instead, it seems essential that different variants can be generated from a single source for achieving performance portability across GPU devices. CCS CONCEPTS• Computing methodologies → Parallel programming languages; • Theory of computation → Semantics and reasoning; • Software and its engineering → Compilers.

show abstract

Comparing unified, pinned, and host/device memory allocations for memory‐intensive workloads on Tegra SoC

Cited by 8 publications

References 17 publications

Two Distributed Arithmetic Based High Throughput Architectures of Non-Pipelined LMS Adaptive Filters

Two Distributed Arithmetic Based High Throughput Architectures of Non-Pipelined LMS Adaptive Filters

Special Issue on High‐end Heterogeneous Architectures, Methodologies, and Algorithms (HHAMA20)

Effective Host-GPU Memory Management Through Code Generation

Contact Info

Product

Resources

About