WE ARE EXPLORING a 3D processor-memory stack for use with the Message Passing Interface (MPI). 1 The communication among processors in huge servers such as the IBM BlueGene/L or the NEC Earth Simulator wastes several thousands of cycles. Most of these wasted cycles do not come from the communication link among the processors across the system, but rather in handling the message packets. A processor that could handle this message packing and communication at a much faster rateon the order of 16 GHz-could significantly increase this task's efficiency and thus increase the utilization of such supercomputers, currently a very low 1%. However, at such high clock rates, the memory wall would become a significant problem. These processor speeds could be achievable in RISC processors using silicon-germanium (SiGe) heterojunction bipolar transistor (HBT) BiCMOS. Processors have been built using BiCMOS technology in the past. 2 The memory architecture would require modification to suit such an application.Tackling this problem requires innovative technologies, such as 3D memories, which alleviate some problems with long on-chip interconnects. The importance of interconnection wires to circuit performance is increasing, because they do not scale like the devices on a chip. The need for shorter interconnection delays suggests shorter interconnection wires. Shorter interconnections are more likely in 3D architectures than in equivalent 2D systems. 3,4 Industry has already carried out several explorative studies in building 3D circuits, including memories and FPGAs (such as the FaStack process, http://www.tezzaron. com). Others have also developed the different waferbonding techniques necessary for 3D integration. The primary advantage of 3D circuits-lower interconnection delay-is also under study in the context of future microprocessors. 5 This article explores the advantages of 3D in a processor-memory stack system. We conducted simulations using simple tools like Dinero IV and the Cache Access and Cycle Time Information (Cacti) to evaluate the performances of various memory architectures.Conventional 2D architectures for processors have the processor and cache in the same plane. L0 and L1 caches are present on the same die as the processor, whereas the L2 cache can be on a separate die. The 2D layout enforces physical constraints on the separation of the processor from the cache. Interconnection wires that connect the processor and cache are long, especially when L2 is in a separate die. This causes multiple clock cycles to pass before data moves from one end to another. Moving to a 3D layout solves some of these problems. The processor we considered is a simple, single-issue RISC processor built using a 7HP SiGe HBT BiCMOS technology featuring minimum bipolar transistor sizes that have a unity gain cut-off frequency (f T ) of 120 GHz. 6