Recent high performance IC design has been dominated by power density constraints. 3D integration increases device density even further, and these devices will not be usable without viable strategies to reduce power consumption. This paper proposes the use of near-threshold computing (NTC) to address this issue in a stacked 3D system. In NTC, cores are operated near the threshold voltage (~200mV above Vth) to optimally balance power and performance [1]. In Centip3De, we operate cores at 650mV, as opposed to the wear-out limited supply voltage of 1.5V. This improves measured energy efficiency by 5.1×. The dramatically lower power consumption of NTC makes it an attractive match for 3D design, which has limited power dissipation capabilities, but also has improved innate power and performance compared to 2D design.Due to higher leakage current in SRAMs compared to logic, memories reach their optimal energy/delay trade-off at higher voltages than cores: 870mV for SRAM and 670mV for logic in 130nm technology. Hence, SRAMs ideally operate at a higher voltage than cores, improving their speed. Centip3De uses this unique cache/core performance inversion by connecting four cores to each cache, where each cache operates at 4× the core frequency and communicates with the cores in a round-robin fashion. This configuration has the added advantage of automatically resolved coherence within the cluster which reduces coherency traffic and overhead.To address Amdahl's law, Centip3De allows some cores in a cluster to be boosted by 2, 4 or 8× in frequency by ramping them to a higher voltage while disabling remaining cores in the cluster to offset the higher power consumption. Disabling the non-boosted cores opens up more of the cache to the boosted core, providing it with additional memory performance. In this way, Centip3De can be configured to maximize single-threaded performance, throughput, or a mixture of both, depending on workload. The fabricated Centip3De system consists of two stacked dies with 64 ARM M3 near-threshold cores that make up 16 four-core clusters, each connected to a 4-way 1kB instruction cache and a 4-way 8kB data cache. The caches communicate over a 3D bus that connects them to DRAM controllers that form the backing store for the caches. Centip3De is designed to be expandable to 4 layers of cores/caches with 2-3 layers of stacked DRAM. This paper provides results for a two-layer system (referred to as the fabricated system), but for completeness we also describe the complete design that will consist of 128 cores, 4-layer logic + 3-layer DRAM (referred to as the expanded system). Final assembly of the expanded system is anticipated at a later date. Figure 10.7.1 shows the floorplan for a cluster, which separates caches and cores into adjacent layers. The four cores communicate with their adjacent cache through a face-to-face (F2F) 3D interface, which reduces routing resource requirements by providing 331 interface connections in the middle of each core. The F2F interconnects have a pitch of 5μm and a loading ...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.