Bandwidth-efficient on-chip interconnect designs for GPGPUs

Jang, Hyunjun; Kim, Jinchun; Gratz, Paul V.; Yum, Ki Hwan; Kim, Eun Jung

doi:10.1145/2744769.2744803

Cited by 64 publications

(31 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We used GPGPU-Sim [19] to collect detailed application traces and simulated the network and memory traffic on a customized Noxim NoC simulator [21] that integrates our MACRO-NoC architecture model. We obtained traces for 11 CUDA benchmarks [8], [20], each with different number of kernels and levels of memory intensity: We compared our architecture with two prior works that also propose NoC architectures for GPGPUs: [10] and [11] (both are discussed in the Section 2). The architecture discussed in [10] is called Direct all-to-all (DA2), while that from [11] is called XY-YX.…”

Section: Methodsmentioning

confidence: 99%

“…We obtained traces for 11 CUDA benchmarks [8], [20], each with different number of kernels and levels of memory intensity: We compared our architecture with two prior works that also propose NoC architectures for GPGPUs: [10] and [11] (both are discussed in the Section 2). The architecture discussed in [10] is called Direct all-to-all (DA2), while that from [11] is called XY-YX. Figure 8 shows the MC placement we used in our 16-core and 64-core platforms, based on recommendations from [10] and [11] on MC placements.…”

Section: Methodsmentioning

confidence: 99%

“…The architecture discussed in [10] is called Direct all-to-all (DA2), while that from [11] is called XY-YX. Figure 8 shows the MC placement we used in our 16-core and 64-core platforms, based on recommendations from [10] and [11] on MC placements. We evaluated network latency, total application execution time and total energy consumption for all three architectures.…”

Section: Methodsmentioning

confidence: 99%

“…DA2 also has limitations due to serialization overhead which effects the scalability of their model in many cases. Jang et al [11] propose an MC placement approach and a virtual channel partitioning scheme in GPGPU NoCs. In their work, MCs are placed along the x axis of the mesh and flits use XY routing for requests and YX routing for replies, to reduce congestion at MCs and competition for links along the Y-axis.…”

Section: Related Workmentioning

confidence: 99%

“…Although NoC design has been studied in detail for more than a decade for CMPs [4][5][6][7], few efforts have analyzed and proposed viable NoC solutions for GPGPUs. The traffic pattern in GPGPUs is primarily many-to-few and few-to-many, with high volumes of traffic skewed towards memory replies [10], [11], [15]. Traditional meshbased NoC topologies used in CMPs are not capable of handling such skewed traffic effectively, leading to underutilized resources, which increases NoC latency, and congestion at MCs.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Memory-aware circuit overlay NoCs for latency optimized GPGPU architectures

Raparti

Pasricha

2016

2016 17th International Symposium on Quality Electronic Design (ISQED)

View full text Add to dashboard Cite

The growing parallelism in most of today's applications has led to an increased demand for parallel computing in processors. General Purpose Graphics Processing Units (GPGPUs) have been used extensively to provide the necessary computation for highly parallel applications. GPGPUs generate huge volumes of network traffic between memory controllers (MCs) and cores. As a result, the network-on-chip (NoC) fabric can become a performance bottleneck, especially for memory intensive applications on GPGPUs. Traditional mesh-based NoC topologies are not suitable for GPGPUs as they possess high network latency that leads to congestion at MCs and an increase in application execution time. In this paper, we propose a novel memory-aware circuit overlay NoC that exploits characteristics of traffic in GPGPUs to eliminate router arbitration at each hop. Our experimental results show that our approach yields an improvement of 40-75% in NoC latency, 20-70% in execution time, and 10-65% in overall energy consumption compared to the state-of-the-art.

show abstract