Mani Azimi scite author profile

Application-to-core mapping policies to reduce memory system interference in multi-core systems

Das

¹

,

Ausavarungnirun

²

,

Mutlu

³

et al. 2013

View full text Add to dashboard Cite

Future many-core processors are likely to concurrently execute a large number of diverse applications. How these applications are mapped to cores largely determines the interference between these applications in critical shared resources such as the network-on-chip. In this paper, we propose applicationto-core mapping policies to reduce the contention in network-on-chip and memory controller resources and hence improve overall system performance. The key ideas of our policies are to: 1) map networklatency-sensitive applications to separate node clusters in the network from network-bandwidth-intensive applications such that the former makes fast progress without heavy interference from the latter, 2) map those applications that benefit more from being closer to the memory controllers close to these resources. Contrary to the conventional wisdom of balancing network or memory load across the network-on-chip and controllers, we observe that it is also important to ensure that applications that are more sensitive to network latency experience little interference from applications that are network-bandwidth-intensive, even at the cost of load imbalance.We evaluate the proposed application-to-core mapping policies on a 60-core system with an 8x8 mesh NoC using a suite of 35 diverse applications. Averaged over 128 randomly generated multiprogrammed workloads, the final proposed policy improves system throughput by 16.7% in terms of weighted speedup over a state-of-the-art baseline, while also reducing system unfairness by 22.4% and average interconnect power consumption by 52.3%.

show abstract

Integration Challenges and Tradeoffs for Terascale Architectures

Azimi¹

2007

ITJ

View full text Add to dashboard Cite

SIMD divergence optimization through intra-warp compaction

Vaidya

¹

,

Shayesteh

²

,

Woo

³

et al. 2013

SIGARCH Comput. Archit. News

View full text Add to dashboard Cite

SIMD execution units in GPUs are increasingly used for high performance and energy efficient acceleration of general purpose applications. However, SIMD control flow divergence effects can result in reduced execution efficiency in a class of GPGPU applications, classified as divergent applications. Improving SIMD efficiency, therefore, has the potential to bring significant performance and energy benefits to a wide range of such data parallel applications. Recently, the SIMD divergence problem has received increased attention, and several micro-architectural techniques have been proposed to address various aspects of this problem. However, these techniques are often quite complex and, therefore, unlikely candidates for practical implementation. In this paper, we propose two micro-architectural optimizations for GPGPU architectures, which utilize relatively simple execution cycle compression techniques when certain groups of turned-off lanes exist in the instruction stream. We refer to these optimizations as basic cycle compression (BCC) and swizzled-cycle compression (SCC), respectively. In this paper, we will outline the additional requirements for implementing these optimizations in the context of the studied GPGPU architecture. Our evaluations with divergent SIMD workloads from OpenCL (GPGPU) and OpenGL (graphics) applications show that BCC and SCC reduce execution cycles in divergent applications by as much as 42% (20% on average). For a subset of divergent workloads, the execution time is reduced by an average of 7% for today's GPUs or by 18% for future GPUs with a better provisioned memory subsystem. The key contribution of our work is in simplifying the micro-architecture for delivering divergence optimizations while providing the bulk of the benefits of more complex approaches.

show abstract

Application-to-core mapping policies to reduce memory interference in multi-core systems

Das

¹

,

Ausavarungnirun

²

,

Mutlu

³

et al. 2012

View full text Add to dashboard Cite

Future many-core processors are likely to concurrently execute a large number of diverse applications. How these applications are mapped to cores largely determines the interference between these applications in critical shared resources such as the network-on-chip. In this paper, we propose applicationto-core mapping policies to reduce the contention in network-on-chip and memory controller resources and hence improve overall system performance. The key ideas of our policies are to: 1) map networklatency-sensitive applications to separate node clusters in the network from network-bandwidth-intensive applications such that the former makes fast progress without heavy interference from the latter, 2) map those applications that benefit more from being closer to the memory controllers close to these resources. Contrary to the conventional wisdom of balancing network or memory load across the network-on-chip and controllers, we observe that it is also important to ensure that applications that are more sensitive to network latency experience little interference from applications that are network-bandwidth-intensive, even at the cost of load imbalance.We evaluate the proposed application-to-core mapping policies on a 60-core system with an 8x8 mesh NoC using a suite of 35 diverse applications. Averaged over 128 randomly generated multiprogrammed workloads, the final proposed policy improves system throughput by 16.7% in terms of weighted speedup over a state-of-the-art baseline, while also reducing system unfairness by 22.4% and average interconnect power consumption by 52.3%.

show abstract