Abstract:Many-core architectures provide an efficient way of harnessing the increasing numbers of transistors available in modern fabrication processes. While they are similar to multi-node systems, they exhibit different communication latency and storage characteristics, providing new design opportunities that were previously not feasible. Traditional cache coherence protocols, although often used in many-core designs, have been developed in the context of multinode systems. As such, they seldom take advantage of the … Show more
“…The research works most closely related to CCM are victim or replication strategies [6,12,16,20,26,31,32], Cooperative Caching strategies [1,5,10,11,13,17,18], and to a lesser extent, hierarchical directory coherence [8,14,15,22,33,34].…”
Section: Related Workmentioning
confidence: 99%
“…CCM requires only marginal modifications of the network interface. Barrow [5] leverages the proximity data by sending requesting messages to all neighbors, which complicates the coherence protocol. Acacio [1] and Hossain [18] also introduces the concept of using the nearby data.…”
As the number of cores and the working sets of parallel workloads increase, shared L2 caches exhibit fewer misses than private L2 caches by making a better use of the total available cache capacity, but they also induce higher overall L1 miss latencies because of the longer average distance between two nodes, and the potential congestions at certain nodes. One of the main causes of the long L1 miss latencies are accesses to home nodes of the directory. However, we have observed that there is a high probability that the target data of an L1 miss resides in the L1 cache of a neighbor node. In such cases, these long-distance accesses to the home nodes can be potentially avoided. We organize the multi-core into clusters of 2 × 2 nodes, and in order to leverage the aforementioned property, we introduce the Cluster Cache Monitor (CCM). The CCM is a hardware structure in charge of detecting whether an L1 miss can be served by one of the cluster L1 caches, and two cluster-related states are added in the coherence protocol in order to avoid long-distance accesses to home nodes upon hits in the cluster L1 caches. We evaluate this approach on a 64-node multi-core using SPLASH-2 and PARSEC benchmarks, and we find that the CCM can reduce the execution time by 15 % and reduce the energy by 14 %, while saving 28 % of the directory storage area compared to a standard multi-core with a shared L2. We also show that the CCM outperforms recent mechanisms, such as ASR, DCC and RNUCA.
“…The research works most closely related to CCM are victim or replication strategies [6,12,16,20,26,31,32], Cooperative Caching strategies [1,5,10,11,13,17,18], and to a lesser extent, hierarchical directory coherence [8,14,15,22,33,34].…”
Section: Related Workmentioning
confidence: 99%
“…CCM requires only marginal modifications of the network interface. Barrow [5] leverages the proximity data by sending requesting messages to all neighbors, which complicates the coherence protocol. Acacio [1] and Hossain [18] also introduces the concept of using the nearby data.…”
As the number of cores and the working sets of parallel workloads increase, shared L2 caches exhibit fewer misses than private L2 caches by making a better use of the total available cache capacity, but they also induce higher overall L1 miss latencies because of the longer average distance between two nodes, and the potential congestions at certain nodes. One of the main causes of the long L1 miss latencies are accesses to home nodes of the directory. However, we have observed that there is a high probability that the target data of an L1 miss resides in the L1 cache of a neighbor node. In such cases, these long-distance accesses to the home nodes can be potentially avoided. We organize the multi-core into clusters of 2 × 2 nodes, and in order to leverage the aforementioned property, we introduce the Cluster Cache Monitor (CCM). The CCM is a hardware structure in charge of detecting whether an L1 miss can be served by one of the cluster L1 caches, and two cluster-related states are added in the coherence protocol in order to avoid long-distance accesses to home nodes upon hits in the cluster L1 caches. We evaluate this approach on a 64-node multi-core using SPLASH-2 and PARSEC benchmarks, and we find that the CCM can reduce the execution time by 15 % and reduce the energy by 14 %, while saving 28 % of the directory storage area compared to a standard multi-core with a shared L2. We also show that the CCM outperforms recent mechanisms, such as ASR, DCC and RNUCA.
“…Cache coherence designs to exploit the proximity of data sharers have been proposed in [6,7]. Williams et.…”
Section: Related Work and Conclusionmentioning
confidence: 99%
“…al. [7] propose to add direct links in four directions of NoC routers to snoop sharers in direct neighbors. However, their scheme depends on specific application mapping to work and has more hardware overhead.…”
Section: Related Work and Conclusionmentioning
confidence: 99%
“…As a result, if the bus configurations are fixed as in [10] and [7], the effectiveness of the snooping will be compromised since the number of possible shares searched is not related to the bus length. In another word, even if we increase the length of snooping buses, the sharers found may not increase accordingly.…”
Section: B1 Mapping Of Parallel Programs Onto a Cmp Platformmentioning
On chip many-core systems, evolving from prior multi-pro cessor systems, are considered as a promising solution to the performance scalability and power consumption problems. The long communication distance between the traditional multi-processors makes directory-based cache coherence protocols better solutions compared to bus-based snooping protocols even with the overheads from indirections. However, much smaller distances between the CMP cores enhance the reachability of buses, revitalizing the applicability of snooping protocols for cache-to-cache transfers. In this work, we propose a hybrid NoC design to provide optimized support for cache coherency. In our design, on-chip links can be dynamically configured as either point-to-point links between NoC nodes or short buses to facilitate localized snooping. By taking advantage of the best of both worlds, bus-based snooping coherency and NoC-based directory coherency, our approach brings both power and performance benefits.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.