Horizon is the name currently being used to refer to a sharedmemory Multiple Instruction stream -Multiple Data stream (MIMD) computer architecture under study by independent groups at the Supercomputing Research Center and at Tera Computer Company. Its performance target is a sustained rate of 100 giga (10") Floating Point Operations Per Second (FLOPS). Horizon achieves this speed with a few hundred identical scalar processors. Each processor has a horizontal instruction set that allows the production of one or more floating point results per cycle without resorting to vector operations. Memory latency is hidden, assuming enough parallelism is available, by allowing processors to switch context on each machine cycle,In this overview, the Horizon architecture is introduced and its performance is estimated. The processor instruction set and a simple programming example are given. Additional details on the processor architecture, interconnection network design, performance analyses, machine simulator, compiler development, and application studies can be found in companion papers.
I. Design PhilosophyShared-memory MLMD computers that can be utilized effectively are difficult to implement The principal difficulty is the latency associated with memory access and its consequences for processor performance. If many processors are sharing memory, propagation delays in the interconnection network through either logic circuitry or wiring will limit the minimum latency attainable. There are at least two ways of lowering the effective latency, each of which should be employed to approach a minimum: latency reduction and latency hiding. Latency reduction is accomplished by arranging a processor's memory accesses so that most of them are to locations that are both spatially and temporally nearby. Caches, such as those used in Cedar and the IBM RP3 11, 21, are a very popular and effective device for latency reduction. Larency hiding is brought about by introducing virtual processors and additional parallelism. While a virtual processor waits for a memory request, the physical processor switches to another task and continues to compute, as in HEP [3]. Memory requests en route from a processor to a memory or vice-versa in a shared-memory system will be. referred to as messages.The time needed to switch between tasks is important to the programmer because it determines the maximum message rate that the system will support. If a decomposition of a problem into parallel parts results in too few instructions executed per message sent, then the performance of each physical processor in the system will be limited by the peak message rate. In such cases, the system overhead of a virtual processor implementation is too great for the proposed problem decomposition, and another decomposition +Work described here was performed should be sought that requires less virtual processor switching. In other words, systems with lightweight (easily switched) virtual processors are more generally applicablc than are systems with heavyweight (slow to s...