Because of the increasing density of VLSI integrated circuits, most of the chip area of modern computers is now occupied by memory and not by processing resources. The M-Machine is an experimental multicomputer being developed to test architectural concepts motivated by these constraints of modern semiconductor technology and the demands of programming systems, such as faster execution of fixed sized problems and easier programmability of parallel computers. Advances in VLSI technology have resulted in computers with chip area dominated by memory and not by processing resources. The normalized area (in ( 2 ) of a VLSI chip 1 is increasing by 50% per year, while gate speed and communication bandwidth are increasing by 20% per year [14]. As a result, a 64-bit proces-
sor with a pipelined FPU (400M(2 ) 2 is only 8% of a 5G( 2 1996 0.355 m chip. In a system with 256 MBytes of DRAM, the processor accounts for 0.13% of the silicon area in the system. The memory system, cache, TLB, controllers, and DRAM account for most of the remaining area. Technology scaling has made the memory, rather than the processor, the most area-consuming resource in a computer system.To address this imbalance, the M-Machine increases the fraction of chip area devoted to processor, making better use of the critical memory resources. An M-Machine multi-ALU processor (MAP) chip contains four 64-bit three-issue clusters that comprise 32% of the 5G( 2 chip and 11% of an 8 MByte (six-chip) node. The multiple execution clusters will provide better peak performance than using a single cluster and a large on-chip cache in the same chip area. The high ratio of arithmetic bandwidth to memory bandwidth (12 operations/word) allows the MAP to saturate the costly DRAM bandwidth even on code with high cache-hit ratios. A 32-node M-Machine system with 256 MBytes of memory has 128 times the peak performance of a 1996 uniprocessor with the same memory capacity at 1.5 times the area, a 85:1 improvement in peak performance/area. Even at a small fraction of this peak performance, such a machine allows the costly, fixed-sized memory to handle more problems per unit time resulting in more cost-effective computing.The M-Machine is intended to extract more parallelism from problems of a fixed size, rather than requiring enormous problems to achieve peak performance. To do this, nodes are designed to manage parallelism at a variety of granularities, from the instruction level to the process level. The 12 function units in a single M-Machine node are controlled using a form of Processor Coupling [18] to exploit instruction level parallelism by executing 12 operations from the same thread, or to exploit thread-level parallelism by executing operations from up to six different threads. The fast internode communication allows collaborating threads to reside on different nodes.The M-Machine also addresses the demand for easier programmability by providing an incremental path for increasing parallelism and performance. An unmodified sequential program can 2 Area was determ...