Rebalancing the core front-end through HPC code analysis

Milić, Uglješa; Carpenter, Paul; Rico, Alejandro; Ramírez, Alex

doi:10.1109/iiswc.2016.7581273

Cited by 1 publication

(2 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Asymmetric processors have been proposed as a heterogeneous, single-ISA multicore design to reduce the execution time of a parallel application for a given hardware budget [4], [6], [18]- [20]. The large core (latency sensitive) would be used to execute serial bottleneck, while many small cores (throughput oriented) run parallel code.…”

Section: Related Workmentioning

confidence: 99%

“…Heavyweight cores support large instruction footprints and complex branch behavior with private instruction caches (I-cache) and sophisticated branch predictors. On the other hand, HPC applications have small(er) code footprint, long(er) basic blocks, and (more) predictable branches [6]. Moreover, all parallel threads in HPC applications execute the same code approximately at the same time.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Sharing the instruction cache among lean cores on an asymmetric CMP for HPC applications

Milić

Rico

Carpenter

et al. 2017

2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

Self Cite

View full text Add to dashboard Cite

High performance computing (HPC) applications have parallel code sections that must scale to large numbers of cores, which makes them sensitive to serial regions. Current supercomputing systems with heterogeneous or asymmetric CMPs (ACMP) combine few high-performance big cores for serial regions, together with many low-power lean cores for throughput computing. The low requirements of HPC applications in the core front-end lead some designs, such as SMT and GPU cores, to share front-end structures including the instruction cache (I-cache). However, little work exists to analyze the benefit of sharing the I-cache among full cores, which seems compelling as a solution to reduce silicon area and power. This paper analyzes the performance, power and area impact of such a design on an ACMP with one high-performance core and multiple low-power cores. Having identified that multiple cores run the same code during parallel regions, the lean cores share the I-cache with the intent of benefiting from mutual prefetching, without increasing the average access latency. Our exploration of the multiple parameters finds the sweet spot on a wide interconnect to access the shared I-cache and the inclusion of a few line buffers to provide the required bandwidth and latency to sustain performance. The projections with McPAT and a rich set of HPC benchmarks show 11% area savings with a 5% energy reduction at no performance cost.The research was supported by European Unions 7th Framework Programme [FP7/2007-2013] under project Mont-Blanc\ud (288777), the Ministry of Economy and Competitiveness of Spain (TIN2012-34557, TIN2015-65316-P, and BES-2013-063925), Generalitat de Catalunya (2014-SGR-1051 and 2014-SGR-1272), HiPEAC-3 Network of Excellence (ICT-287759), and finally the Severo Ochoa Program (SEV-2011-00067) of\ud the Spanish Government.Peer ReviewedPostprint (author's final draft

show abstract