The performance impact of flexibility in the Stanford FLASH multiprocessor

Heinrich, Mark; Kuskin, Jeffrey S.; Ofelt, David; Heinlein, John; Baxter, Joel; Singh, Jaswinder Pal; Simoni, Richard; Gharachorloo, Kourosh; Nakahira, D.; Horowitz, Mark; Gupta, Anoop; Rosenblum, Mendel; Hennessy, John L.

doi:10.1145/195470.195569

Cited by 46 publications

(3 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…DE-FG02-03ER25564 writing correct and efficient shared-memory programs without locks, semaphores, or condition variables. Stanford has produced significant research results on this topic [6][7][8] [9], and an ongoing collaboration between Stanford and USC/ISI has produced a prototype optimizing compiler for high-level TCC constructs.…”

Section: Transactional Coherence and Consistency (Tcc)mentioning

confidence: 99%

Exploring Shared Memory Protocols in FLASH

Horowitz¹,

Kunz²,

Hall³

et al. 2007

Self Cite

View full text Add to dashboard Cite

The goal of this project was to improve the performance of large scientific and engineering applications through collaborative hardware and software mechanisms to manage the memory hierarchy of non-uniform memory access time (NUMA) sharedmemory machines, as well as their component individual processors. In spite of the programming advantages of shared-memory platforms, obtaining good performance for large scientific and engineering applications on such machines can be challenging. Because communication between processors is managed implicitly by the hardware, rather than expressed by the programmer, application performance may suffer from unintended communication-communication that the programmer did not consider when developing his/her application. In this project, we developed and evaluated a collection of hardware, compiler, languages and performance monitoring tools to obtain high performance on scientific and engineering applications on NUMA platforms by managing communication through alternative coherence mechanisms. Alternative coherence mechanisms have often been discussed as a means for reducing unintended communication, although architecture implementations of such mechanisms are quite rare. This report describes an actual implementation of a set of coherence protocols that support coherent, non-coherent and write-update accesses for a CC-NUMA shared-memory architecture, the Stanford FLASH machine. Such an approach has the advantages of using alternative coherence only where it is beneficial, and also provides an evolutionary migration path for improving application performance. We present data on two computations, RandomAccess from the HPC Challenge benchmarks and a forward solver derived from LS-DYNA, showing the performance advantages of the alternative coherence mechanisms. For RandomAccess, the non-coherent and write-update versions can outperform the coherent version by factors of 5 and 2.5, respectively. In LS-DYNA, we obtain improvements of 18% on average using the non-coherent version. We also present data on the SpecOMP benchmarks, showing that the protocols have a modest overhead of less than 3% in applications where the alternative mechanisms are not needed. In addition to the selective coherence studies on the FLASH machine, in the last six months of this project ISI performed research on compiler technology for the transactional memory (TM) programming model being developed at Stanford. As part of this research ISI developed a compiler that recognizes transactional memory "pragmas" and automatically generates parallel code for the TM programming model.

show abstract

Section: Transactional Coherence and Consistency (Tcc)mentioning

confidence: 99%

Exploring Shared Memory Protocols in FLASH

Horowitz¹,

Kunz²,

Hall³

et al. 2007

Self Cite

View full text Add to dashboard Cite

show abstract

“…Hardware can thus directly initiate a DRAM access once the tag lookup indicates a miss. The latency of the software handler's replacement decision then overlaps the main memory access latency [11]. 1.…”

Section: Access Time Overheadmentioning

confidence: 99%

A fully associative software-managed cache design

Hallnor

Reinhardt

2000

SIGARCH Comput. Archit. News

View full text Add to dashboard Cite

As DRAM access latencies approach a thousand instruction-execution times and onchip caches grow to multiple megabytes, it is not clear that conventional cache structures continue to be appropriate. Two key featuresfull associativity and software managementhave been used successfully in the virtual-memory domain to cope with disk access latencies. Future systems will need to employ similar techniques to deal with DRAM latencies. This paper presents a practical, fully associative, software-managed secondary cache system that provides performance competitive with or superior to traditional caches without OS or application involvement. We see this structure as the first step toward OS-and application-aware management of large on-chip caches.This paper has two primary contributions: a practical design for a fully associative memory structure, the indirect index cache (IIC), and a novel replacement algorithm, generational replacement, that is specifically designed to work with the IIC. We analyze the behavior of an IIC with generational replacement as a drop-in, transparent substitute for a conventional secondary cache. We achieve miss rate reductions from 8% to 85% relative to a 4-way associative LRU organization, matching or beating a (practically infeasible) fully associative true LRU cache. Incorporating these miss rates into a rudimentary timing model indicates that the IIC/generational replacement cache could be competitive with a conventional cache at today's DRAM latencies, and will outperform a conventional cache as these CPU-relative latencies grow.

show abstract

“…Several recent commercial and research multiprocessor systems [18,15,22] have employed programmable coherence controllers to reduce design time and/or support multiple protocols. However, the flexibility and generality of a programmable controller leads to slower coherence protocol execution, which in turn increases controller occupancy and memory latency [11]. The extent to which this degrades application performance has been the subject of several detailed simulation studies [12,23,19].…”

Section: Programmable Coherence Controllersmentioning

confidence: 99%

Analytic evaluation of shared-memory systems with ILP processors

Sorin

Pai

Adve

et al.

Proceedings. 25th Annual International Symposium on Computer Architecture (Cat. No.98CB36235)

View full text Add to dashboard Cite

This paper develops and validates an analytical model for evaluating various types of architectural alternatives for shared-memory systems with processors that aggressively exploit instruction-level parallelism. Compared to simulation, the analytical model is many orders of magnitude faster to solve, yielding highly accurate system performance estimates in seconds. The model input parameters characterize the ability of an application to exploit instruction-level parallelism as well as the interaction between the application and the memory system architecture. A trace-driven simulation methodology is developed that allows these parameters to be generated over 100 times faster than with a detailed execution-driven simulator. Finally, this paper shows that the analytical model can be used to gain insights into application performance and to evaluate architectural design trade-offs.

show abstract

The performance impact of flexibility in the Stanford FLASH multiprocessor

Cited by 46 publications

References 13 publications

Exploring Shared Memory Protocols in FLASH

Exploring Shared Memory Protocols in FLASH

A fully associative software-managed cache design

Analytic evaluation of shared-memory systems with ILP processors

Contact Info

Product

Resources

About