StenströmPer scite author profile

Caching and other latency tolerating techniques have been quite successful in maintaining high memory system performance for general purpose processors. However, TLB misses have become a serious bottleneck as working sets are growing beyond the capacity of TLBs. This work presents one of the first attempts to hide TLB miss latency by using preloading techniques. We present results for traditional next-page TLB miss preloading - an approach shown to cut some of the misses. However, a key contribution of this work is a novel TLB miss prediction algorithm based on the concept of “recency”, and we show that it can predict over 55% of the TLB misses for the five commercial applications considered.

SIGOPS Oper. Syst. Rev.

Simple compiler algorithms to reduce ownership overhead in cache coherence protocols

SkeppstedtJonas

StenströmPer

1994

We study in this paper the design and efficiency of compiler algorithms that remove ownership overhead in shared-memory multiprocessors with write-invalidate protocols. These algorithms detect loads followed by stores to the same address. Such loads are marked and constitute a hint to the cache to obtain an exclusive copy of the block. We consider three algorithms where the first one focuses on load-store sequences within each basic block of code and the other two analyse the existence of load-store sequences across basic blocks at the intra-procedural level. Since the dataflow analysis we adopt is a trivial variation of live-variable analysis, the algorithms are easily incorporated into a compiler. Through detailed simulations of a cache-coherent NUMA architecture using five scientific parallel benchmark programs, we find that the algorithms are capable of removing over 95% of the separate ownership acquisitions. Moreover, we also find that even the simplest algorithm is comparable in efficiency with previously proposed hardware-based adaptive cache coherence protocols to attack the same problem.

An LRU-based replacement algorithm augmented with frequency of access in shared chip-multiprocessor caches

DybdahlHaakon

StenströmPer

NatvigLasse

2007

This paper proposes a new replacement algorithm to protect cache lines with potential future reuse from being evicted. In contrast to the recency based approaches used in the past (LRU for example), our algorithm also uses the notion of frequency of access . Instead of evicting the least recently used block, our algorithm identifies among a set of LRU blocks the one that is also least-frequently-used (according to a heuristic) and chooses that as a victim. We have implemented this replacement algorithm in a detailed simulation model of a chip multiprocessor system driven by SPEC2000 benchmarks. We have found that the new scheme improves performance for memory intensive applications. Moreover, as compared to other attempts, our replacement algorithm provides robust improvements across all benchmarks. We have also extended an earlier scheme proposed by Wong and Baer so it is switched off when performance is not improved. Our results show that this makes the scheme much more suitable for CMP configurations.

Comparative performance evaluation of cache-coherent NUMA and COMA architectures

StenströmPer¹,

JoeTruman²,

GuptaAnoop³

1992

Two interesting variations of large-scale shared-memory machines that have recently emerged are cache-coherent mmumform-memory-access machines (CC-NUMA) and cacheonly memory architectures (COMA). They both have distributed main memory and use directory-based cache coherence. Unlike CC-NUMA, however, COMA machines automatically migrate and replicate data at the main-memoty level in cache-line sized chunks. This paper compares the performance of these two classes of machines.We first present a qualitative model that shows that the relative performance is primarily determined by two factors: the relative magnitude of capacity misses versus coherence misses, and the gramrhirity of data partitions in the application.We then present quantitative results using simulation studies for eight prtraUeI applications (including all six applications from the SPLASH benchmark suite). We show that COMA's potential for performance improvement is limited to applications where data accesses by different processors are finely interleaved in memory space and, in addition, where capacity misses dominate over coherence misses. In other situations, for example where coherence misses dominate, COMA can actually perform worse than CC-NUMA due to increased miss latencies caused by its hierarchical directories. Finally, we propose a new architectural alternative, called COMA-F, that combines the advantages of both CC-NUMA and COMA.

An adaptive cache coherence protocol optimized for migratory sharing

StenströmPer¹,

BrorssonMats²,

SandbergLars³

1993

Parallel programs that use critical sections and are executed on a shared-memory multiprocessor with a writeinvalidate protocol result in invalidation actions that could be eliminated.For this type of sharing, called m"gratory sharing, each processor typically causes a cache miss followed by an invalidation request which could be merged with the preceding cache-muss request.In this paper we propose an adaptive protocol that invokes this optimization dynamically for n'gratory blocks. For other blocks, the protocol works as an ordinary write-invalidate protocol. We show that the protocol is a simple extension to a write-invalidate protocol. Based on a program-driven simulation model of an architecture sim"lar to the Stanford DASH, and a set of four benchmarks, we evaluate the potential performance improvements of the protocol. We jind that it effectively eliminates most single invalidations which improves the performance by reducing the shared access penalty and the network trajic.