Improvements in semiconductor nanotechnology
have continuously provided a crescent number of
faster and smaller per-chip transistors. Consequent
classical techniques for boosting performance, such as
the increase of clock frequency and the amount of
work performed at each clock cycle, can no longer
deliver to significant improvement due to energy
constrains and wire delay effects. As a consequence,
designers interests have shifted toward the
implementation of systems with multiple cores per chip
(Chip Multiprocessors, CMP). CMP systems typically
adopt a large last-level-cache (LLC) shared among all
cores, and private L1 caches. As the miss resolution
time for private caches depends on the response time
of the LLC, which is wire-delay dominated,
performance are affected by wire delay. NUCA caches
have been proposed for single and multi core systems
as a mechanism for such tolerating wire-delay effects
on the overall performance.
In this paper, we introduce our design for S-NUCA
and D-NUCA cache memory systems, and we present
an analysis of an 8-cpu CMP system with two levels of
cache, in which the L1s are private, while the L2 is a
NUCA shared among all cores. We considered two
different system topologies (the first with the eight cpus
connected to the NUCA at the same side -8p-, the
second with half of the cpus on one side and the others
at the opposite side -4+4p), and for all the
configurations we evaluate the effectiveness of both the
static and dynamic policies that have been proposed.
Our results show that adopting a D-NUCA scheme
with the 8p configuration is the best performing
solution among all the considered configurations, and
that for the 4+4p configuration the D-NUCA
outperforms the S-NUCA in most of the cases. We
highlight that performance are tied to both mapping
strategy variations (Static and Dynamic) and topology
changes. We also observe that bandwidth occupancy
depends on both the NUCA policy and topology