Breaking the bandwidth wall in chip multiprocessors

Vega, Augusto; Cabarcas, Felipe; Ramírez, Alex; Valero, Mateo

doi:10.1109/samos.2011.6045469

Cited by 6 publications

(9 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Stencil calculations perform global sweeps through data structures that are typically much larger than the available data caches. As a result, data from main memory often cannot be transferred fast enough to avoid stalling the computational units on modern microprocessors [74,18,12,70,66]. Reorganizing these computations to fit into the caches has principally focused on tiling optimizations that exploit locality by performing operations on cache-sized blocks of data in each processor before moving on to the next block [56].…”

Section: Stencil Computationsmentioning

confidence: 99%

“…CMS has broad applicability because a wide variety of stencil-based kernels are memory bound [65,70,32,36]. Stencil-based kernels are also critical because they comprise the building blocks of applications ranging from image processing in consumer devices to the largest scale HPC applications such as climate modeling and fluid simulations.…”

Section: Operation Completion Timementioning

confidence: 99%

“…In addition, numerous image processing applications are also memory bandwidth bound, such as the matrix-vector product in NVIDIA Fermi [1] and face recognition [38]. High graphics processing demands force the Xbox 360 to have 22.4GB/s of GDDR3 bandwidth to satisfy just three processors [70] Finally, CMS reduces memory access latency, which directly benefits latency-sensitive systems [66].…”

Section: Operation Completion Timementioning

confidence: 99%

“…Memory bandwidth is not scaling rapidly enough to satisfy the increasing number of processors, making the performance of a wide variety of applications constrained by memory bandwidth [70,66,59,12,32,18,19,28,30,2,55]. In fact, current projections state that chip pins increase by 10% every year whereas on-chip processors double every 18 months [59].…”

Section: Introductionmentioning

confidence: 99%

“…Even SPEC benchmarks can saturate memory bandwidth in just eight-core CMPs [39]. In memory bandwidth-bound applications, techniques that increase memory bandwidth have a direct effect on execution time [65,70].…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Collective Memory Transfers for Multi-Core Chips

Williams¹,

Shalf²

2013

View full text Add to dashboard Cite

Section: Stencil Computationsmentioning

confidence: 99%

Section: Operation Completion Timementioning

confidence: 99%

Section: Operation Completion Timementioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Collective Memory Transfers for Multi-Core Chips

Williams¹,

Shalf²

2013

View full text Add to dashboard Cite

TiDA: High-Level Programming Abstractions for Data Locality Management

Unat

Nguyen

Zhang

et al. 2016

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Phase-Based Data Placement Scheme for Heterogeneous Memory Systems

Laghari

Ahmad

Unat

2018

2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

View full text Add to dashboard Cite

Heterogeneous memory systems are equipped with two or more types of memories, which work in tandem to complement the capabilities of each other. The multiple memories can vary in latency, bandwidth and capacity characteristics across systems and they come in various configurations that can be managed by the programmer. This introduces an added programming complexity for the programmer. In this paper, we present a dynamic phase-based data placement scheme to assist the programmer in making decisions about program object allocations. We devise a cost model to assess the benefit of having an object in one type of memory over the other and apply the cost model at every application phase to capture the dynamic behaviour of an application. Our cost model takes into account the reference counts of objects and incurred transfer overhead when making a suggestion. In addition, objects can be transferred across memories asynchronously between phases to mask some of the transfer overhead. We test our cost model with a diverse set of applications from NAS Parallel and Rodinia benchmarks and perform experiments on Intel KNL, which is equipped with a high bandwidth memory (MCDRAM) and a high capacity memory (DDR). Our dynamic phase-based data placement performs better than initial placement and achieves comparable or better performance than cache mode of MCDRAM.

show abstract

Breaking the bandwidth wall in chip multiprocessors

Cited by 6 publications

References 19 publications

Collective Memory Transfers for Multi-Core Chips

Collective Memory Transfers for Multi-Core Chips

TiDA: High-Level Programming Abstractions for Data Locality Management

Phase-Based Data Placement Scheme for Heterogeneous Memory Systems

Contact Info

Product

Resources

About