CPCIe: A compression-enabled PCIe core for energy and performance optimization

Zainol, Mohd Amiruddin Bin; Nunez-Yanez, Jose

doi:10.1109/norchip.2016.7792892

Cited by 3 publications

(2 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Finally, Occam's latency penalty for the initial inter-chip transfer has minimal impact on performance (Section III-E). The typical-case PCIe latency of 30µs per partition (PCIe latency varies from 10µs to 50µs [42]) results in slightly reducing the average 2.06x speedup to approximately 2.01x. (Subsequent transfers are hidden under computation due to We discuss next the energy penalty of inter-chip communication.…”

Section: A Analytical Resultsmentioning

confidence: 99%

“…Occam drastically cuts memory transfers (21x on average, in Table III) but incurs the extra energy of chipto-chip transfers at partition boundaries. The net effect of these factors is 33% average reduction in energy because the energy cost/bit for DRAM and PCIe are similar (6pJ/bit [32], [42]). Finally, Layer Fusion's memory energy saving due to fewer transfers (Table III) are offset by its significant recomputation overhead induced by its sub-optimal tiles, resulting in a net energy saving of 12% on average.…”

Section: A Analytical Resultsmentioning

confidence: 99%

See 1 more Smart Citation

OCCAM: Optimal Data Reuse for Convolutional Neural Networks

Gondimalla¹,

Liu²,

Vijaykumar³

et al. 2021

Preprint

View full text Add to dashboard Cite

Convolutional neural networks (CNNs) are emerging as powerful tools for image processing in important commercial applications. We focus on the important problem of improving the latency of image recognition. While CNNs are amenable highly to prefetching and multithreading to avoid memory latency issues, CNNs' large data -each layer's input, filters, and output -poses a memory bandwidth problem. While previous work captures only some of the enormous data reuse, full reuse implies that the initial input image and filters are read once from off chip and the final output is written once off chip without spilling the intermediate layers' data to off-chip. We propose Occam to capture full reuse via four contributions. First, we identify the necessary condition for full reuse. Second, we identify the dependence closure as the sufficient condition to capture full reuse using the least on-chip memory. Third, because the dependence closure is often too large to fit in on-chip memory, we propose a dynamic programming algorithm that optimally partitions a given CNN to guarantee the least off-chip traffic at the partition boundaries for a given on-chip capacity. While tiling is well-known, our contribution is determining the optimal cross-layer tiles. Occam's partitions reside on different chips forming a pipeline so that a partition's filters and dependence closure remain on-chip as different images pass through (i.e., each partition incurs off-chip traffic only for its inputs and outputs). Finally, because the optimal partitions may result in an unbalanced pipeline, we propose staggered asynchronous pipelines (STAP) which replicates the bottleneck stages to improve throughput by staggering the mini-batches across the replicas. Importantly, STAP achieves balanced pipelines without changing Occam's optimal partitioning. Our simulations show that, on average, Occam cuts off-chip transfers by 21x and achieves 2.06x and 1.36x better performance, and 33% and 24% better energy than the base case and Layer Fusion, respectively. Using an FPGA implementation, Occam performs 5.1x better, on average, than the base case.

show abstract

Section: A Analytical Resultsmentioning

confidence: 99%

Section: A Analytical Resultsmentioning

confidence: 99%

OCCAM: Optimal Data Reuse for Convolutional Neural Networks

Gondimalla¹,

Liu²,

Vijaykumar³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

Preserving Privacy of Neuromorphic Hardware From PCIe Congestion Side-Channel Attack

Das

2023

2023 IEEE 47th Annual Computers, Software, and Applications Conference (COMPSAC)

View full text Add to dashboard Cite

Occam: Optimal Data Reuse for Convolutional Neural Networks

Gondimalla

Liu

Vijaykumar

et al. 2022

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

Convolutional neural networks (CNNs) are emerging as powerful tools for image processing in important commercial applications. We focus on the important problem of improving the latency of image recognition. While CNNs are highly amenable to prefetching and multithreading to avoid memory latency issues, CNNs’ large data – each layer’s input, filters, and output – poses a memory bandwidth problem. While previous work captures only some of the enormous data reuse, full reuse implies that the initial input image and filters are read once from off-chip and the final output is written once off-chip without spilling the intermediate layers’ data to off-chip. We propose Occam to capture full reuse via four contributions. First, we identify the necessary conditions for full reuse. Second, we identify the dependence closure as the sufficient condition to capture full reuse using the least on-chip memory. Third, because the dependence closure is often too large to fit in on-chip memory, we propose a dynamic programming algorithm that optimally partitions a given CNN to guarantee the least off-chip traffic at the partition boundaries for a given on-chip capacity. While tiling is well-known, our contribution determines the optimal cross-layer tiles. Occam’s partitions reside on different chips, forming a pipeline so that a partition’s filters and dependence closure remain on-chip as different images pass through (i.e., each partition incurs off-chip traffic only for its inputs and outputs). Finally, because the optimal partitions may result in an unbalanced pipeline, we propose staggered asynchronous pipelines (STAPs) that replicate bottleneck stages to improve throughput by staggering mini-batches across replicas. Importantly, STAPs achieve balanced pipelines without changing Occam’s optimal partitioning. Our simulations show that, on average, Occam cuts off-chip transfers by 21× and achieves 2.04× and 1.21× better performance, and 33% better energy than the base case, respectively. Using a field-programmable gate array (FPGA) implementation, Occam performs 6.1× and 1.5× better, on average, than the base case and Layer Fusion, respectively.

show abstract

CPCIe: A compression-enabled PCIe core for energy and performance optimization

Cited by 3 publications

References 8 publications

OCCAM: Optimal Data Reuse for Convolutional Neural Networks

OCCAM: Optimal Data Reuse for Convolutional Neural Networks

Preserving Privacy of Neuromorphic Hardware From PCIe Congestion Side-Channel Attack

Occam: Optimal Data Reuse for Convolutional Neural Networks

Contact Info

Product

Resources

About