Large required size, and tolerance to latency and variations in memory access time make L2 memory a suitable option for 3-D integration. In this paper, we present a synthesizable 3-D-stackable L2 memory IP component, which can be attached to a cluster-based multicore platform through its network-on-chip interfaces offering high-bandwidth memory access with low average latency. Our design implements a scalable 3-D-nonuniform memory access (NUMA) architecture based on low latency logarithmic interconnects, which allows stacking of multiple identical memory dies (MDs), supports multiple outstanding transactions, and achieves high clock frequencies due to its highly pipelined nature. We implemented our design with STMicroelectronics CMOS-28-nm low-power technology and obtained a clock frequency of 500 MHz (limited by the access time of the memory arrays, whereas its logic components can operate up to 1 GHz), up to eight stacked dies (4 MB) with a memory density loss of 9%. Benchmark simulation results demonstrate that the addition of 3-D-NUMA to a multicluster system can lead to an average performance boost of 34%. Furthermore, experiments and estimations confirm that 3-D-NUMA is energy and power efficient (38% power reduction due to an architectural clock gating scheme), temperature friendly (over 40°C temperature reduction), and has unique features suitable for low-cost manufacturing (2.3× cost reduction due to identical MD layouts). Finally, 22% yield improvement is achievable in 3-D-NUMA compared with its 2-D counterparts, using the state of the art through-silicon-via technologies.Index Terms-3-D integration, nonuniform memory access (NUMA), physical implementation, tightly coupled data memory.