Modern computer systems extract parallelism from problems at two extremes of granularity: instruction-level parallelism (ILP) and coarse-thread parallelism. VLIW and superscalar processors exploit ILP with a grain size of a single instruction, while multiprocessors extract parallelism from coarse threads with a granularity of many thousands of instructions.The parallelism available at these two extremes is limited. The ILP in applications is restricted by control flow and data dependencies [17], and the hardware in superscalar designs is not scalable. Both the instruction scheduling logic and the register file of a superscalar grow quadratically as the number of execution units is increased. For multicomputers, there is limited coarse thread parallelism at small problem sizes and in many applications.
1The research described in this paper was supported by the Defense Advanced Research Projects Agency and monitored by the Air Force Electronic Systems Division under contract F19628-92-C-0045. This paper describes and evaluates the hardware mechanisms implemented in the MIT Multi-ALU Processor (MAP chip) for extracting fine-thread parallelism. Fine-threads close the parallelism gap between the single instruction granularity of ILP and the thousand instruction granularity of coarse threads by extracting parallelism with a granularity of 50-1000 instructions. This parallelism is orthogonal and complementary to coarse-thread parallelism and ILP. Programs can be accelerated using coarse threads to extract parallelism from outer loops and large co-routines, finethreads to extract parallelism from inner loops and small subcomputations, and ILP to extract parallelism from subexpressions. As they extract parallelism from different portions of a program, coarse-threads, fine-threads, and ILP work synergistically to provide multiplicative speedup.These three modes are also well matched to the architecture of modern multiprocessors. ILP is well suited to extracting parallelism across the execution units of a single processor. Finethreads are appropriate for execution across multiple processors at a single node of a parallel computer where the interaction latencies are on the order of a few cycles. Coarse-threads are appropriate for execution on different nodes of a multiprocessor where interaction latencies are inherently 100s of cycles.Low overhead mechanisms for communication and synchronization are required to exploit fine-grain thread level parallelism. The cost to initiate a task, pass it arguments, synchronize with its completion, and return results must be small compared to the work accomplished by the task. Such inter-thread interaction requires 100s of cycles (2 13 s) on conventional multiprocessors, and 1000s of cycles (2 103 s) on multicomputers. Because of these high overheads, most parallel applications use only coarse threads, with many thousands of instructions between interactions.The Multi-ALU Processor (MAP) chip provides three on-chip processors and methods for quickly communicating and synchronizing among ...