The FLASH multiprocessor efficiently integrates support for cache-coherent shared memory and high-performance message passing, while minimizing both hardware and software overhead. Each node in FLASH contains a microprocessor, a portion of the machine's global memory, a port to the interconnection network, an I/O interface, and a custom node controller called MAGIC. The MAGIC chip handles all communication both within the node and among nodes, using hardwired data paths for efficient data movement and a programmable processor optimized for executing protocol operations. The use of the protocol processor makes FLASH very flexible -it can support a variety of different communication mechanisms -and simplifies the design and implementation.This paper presents the architecture of FLASH and MAGIC, and discusses the base cache-coherence and message-passing protocols. Latency and occupancy numbers, which are derived from our system-level simulator and our Verilog code, are given for several common protocol operations. The paper also describes our software strategy and FLASH's current status.
The FLASH multiprocessor efficiently integrates support for cache-coherent shared memory and high-performance message passing, while minimizing both hardware and software overhead. Each node in FLASH contains a microprocessor, a portion of the machine's global memory, a port to the interconnection network, an I/O interface, and a custom node controller called MAGIC. The MAGIC chip handles all communication both within the node and among nodes, using hardwired data paths for efficient data movement and a programmable processor optimized for executing protocol operations. The use of the protocol processor makes FLASH very flexible -it can support a variety of different communication mechanisms -and simplifies the design and implementation.This paper presents the architecture of FLASH and MAGIC, and discusses the base cache-coherence and message-passing protocols. Latency and occupancy numbers, which are derived from our system-level simulator and our Verilog code, are given for several common protocol operations. The paper also describes our software strategy and FLASH's current status.
A flexible communication mechanism is a desirable feature in multiprocessors because it allows support for multiple communication protocols, expands performance monitoring capabilities, and leads to a simpler design and debug process. In the Stanford FLASH multiprocessor, flexibility is obtained by requiring all transactions in a node to pass through a programmable node controller, called MAGIC. In this paper, we evaluate the performance costs of flexibility by comparing the performance of FLASH to that of an idealized hardwired machine on representative parallel applications and a multiprogramming workload. To measure the performance of FLASH, we use a detailed simulator of the FLASH and MAGIC designs, together with the code sequences that implement the cache-coherence protocol. We find that for a range of optimized parallel applications the performance differences between the idealized machine and FLASH are small. For these programs, either the miss rates are small or the latency of the programmable protocol can be hidden behind the memory access time. For applications that incur a large number of remote misses or exhibit substantial hot-spotting, performance is poor for both machines, though the increased remote access latencies or the occupancy of MAGIC lead to lower performance for the flexible design. In most cases, however, FLASH is only 2%-12% slower than the idealized machine.
Current shared-memory multiprocessors are inherently vulnerable to faults: any significant hardware or system software fault causes the entire system to fail. Unless provisions are made to limit the impact of faults, users will perceive a decrease in reliability when they entrust their applications to larger machines. This paper shows that fault containment techniques can be effectively applied to scalable shared-memory multiprocessors to reduce the reliability problems created by increased machine size.The primary goal of our approach is to leave normal-mode performance unaffected. Rather than using expensive faulttolerance techniques to mask the effects of data and resource loss, our strategy is based on limiting the damage caused by faults to only a portion of the machine. After a hardware fault, we run a distributed recovery algorithm that allows normal operation to be resumed in the functioning parts of the machine.Our approach is implemented in the Stanford FLASH multiprocessor. Using a detailed hardware simulator, we have performed a number of fault injection experiments on a FLASH system running Hive, an operating system designed to support fault containment. The results we report validate our approach and show that in conjunction with an operating system like Hive, we can improve the reliability seen by unmodified applications without substantial performance cost. Simulation results suggest that our algorithms scale well for systems up to 128 processors. NodeControlle (MAGIC) x Memory Node FIGURE 2.1. Overview of the FLASH multiprocessor. FLASH consists of a set of nodes connected through a point-topoint interconnect. Each node contains a portion of the distributed main memory and a node controller that handles cache coherence and other communication within the node and with other nodes.
A flexible communication mechanism is a desirable feature in multiprocessors because it allows support for multiple communication protocols, expands performance monitoring capabilities, and leads to a simpler design and debug process. In the Stanford FLASH multiprocessor, flexibility is obtained by requiring all transactions in a node to pass through a programmable node controller, called MAGIC. In this paper, we evaluate the performance costs of flexibility by comparing the performance of FLASH to that of an idealized hardwired machine on representative parallel applications and a multiprogramming workload. To measure the performance of FLASH, we use a detailed simulator of the FLASH and MAGIC designs, together with the code sequences that implement the cache-coherence protocol. We find that for a range of optimized parallel applications the performance differences between the idealized machine and FLASH are small. For these programs, either the miss rates are small or the latency of the programmable protocol can be hidden behind the memory access time. For applications that incur a large number of remote misses or exhibit substantial hot-spotting, performance is poor for both machines, though the increased remote access latencies or the occupancy of MAGIC lead to lower performance for the flexible design. In most cases, however, FLASH is only 2%-12% slower than the idealized machine.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.