Scaling to ExaFLOPS computing, or 100 times faster than the present version of the Fujitsu K-supercomputer, presents well known challenges, among which are power dissipation, memory capacity and access bandwidth, data locality and fault tolerance. The optimum Amdahl's speed-up strategy is multi faceted, with greater memory bandwidth and lower access latency being generally recognized as areas to improve. To this end, evolutionary compute node architecture is considered based on a multichip interposer platform and a millimeter wave memory interface. The interposer serves as the compute node physical platform and wiring distribution layer connecting the chip multiprocessor (CMP) with oninterposer memory to an organic board. For example, the interposer may be composed of glass to reduce through-via parasitic and support one multi-GFLOPS CMP with sufficient on-interposer DRAM for balanced operation. The memory interface consists of dense arrays of millimeter waveguide with integrated mm wave transceivers and should support 40 Gb/s per channel for an aggregate throughput of 1 TB/s with estimated latency of 10-15 clock cycles. This paper examines channel impediments, design and construction. Data transmission on a 72 GHz carrier frequency and 12 Gb/s OOK modulation will be presented at the conference if available.
I. IntroductionHaving many compute cores on a single die has many advantages, among which are multithreading and reduced power dissipation per unit area, the latter being, in part, due to increasing die size to accommodate greater shared Cache volume, lower clock rates and a lower density of global interconnects. On the down side, within a CMP there arises contention among cores for on-die resources such as shared Cache, access to the inter-core bus, the router I/O interface and to one or more interfaces to main memory. The latter being significant for frequent Cache misses as a reload carries a latency penalty of 180 to 200 clock cycles and can significantly affect CMP performance [1]. The interface to main memory is based on legacy copper bus and legacy FR-4 polymer board technology. There is no alternative at this time. As the range of the high performance memory interface is about 7 cm [2] the volume of main memory that can be packed within this range and outside the CMP cooling package is limited and will lead to a memory-starved processor and lower performance. To prevent this occurrence, a FLOPS-to-Byte ratio greater than 1 should be maintained, depending on available Cache and type of computations. At least 1 Byte of memory should be accessible for each (64 bit) FLOP per second. Lower reload latencies are beneficial. To achieve these operating conditions, a different compute node architecture and main memory interface is proposed.The basic compute node architectures for all modern computers is based on localized logic plus arithmetic functional units supported by multiple data storage units having various intrinsic capacities and latencies. Data and operating instructions are transported from storage ...