I NTRODUCTIONIn response to the growing gap between memory access time and processor speed, DRAM manufacturers have created several new DRAM architectures. This paper presents a simulation-based performance study of a representative group, evaluating each in terms of its effect on total execution time. While there are a number of academic proposals for new DRAM designs, space limits us to covering only existing commercial architectures. To obtain accurate memory-request timing for an aggressive out-of-order processor, we integrate our code into the SimpleScalar tool set [4]. This paper presents a baseline study of a small-system DRAM organization : these are systems with only a handful of DRAM chips (0.1-1GB). We do not consider large-system DRAM organizations with many gigabytes of storage that are highly interleaved. We also study a set of benchmarks that are appropriate for such systems: user-class applications such as compilers and small databases rather than server-class applications such as transaction processing systems. The study asks and answers the following questions:• What is the effect of improvements in DRAM technology on the memory latency and bandwidth problems?Contemporary techniques for improving processor performance and tolerating memory latency are exacerbating the memory bandwidth problem [5]. Our results show that current DRAM architectures are attacking exactly this problem: the most recent technologies (SDRAM, ESDRAM, DDR, and Rambus) have reduced the stall time due to limited bandwidth by a factor of three compared to earlier DRAM architectures. However, the memory-latency component of overhead has not improved.• Where is time spent in the primary memory system (the memory system beyond the cache hierarchy, but not including secondary [disk] or tertiary [backup] storage)? What is the performance benefit of exploiting the page mode of contemporary DRAMs?For the newer DRAM designs, the time to extract the required data from the sense amps/row caches for transmission on the memory bus is the largest component in the average access time, though page mode allows this to be overlapped with column access and the time to transmit the data over the memory bus.• How much locality is there in the address stream that reaches the primary memory system?The stream of addresses that miss the L2 cache contains a significant amount of locality, as measured by the hit-rates in the DRAM row buffers. The hit rates for the applications studied range 2-97%, with a mean hit rate of 40% for a 1MB L2 cache. (This does not include hits to the row buffers when making multiple DRAM requests to read one cache-line.)
High-Performance DRAMs in Workstation EnvironmentsVinodh Cuppu, Student Member, IEEE, Bruce Jacob, Member, IEEE, Brian Davis, Member, IEEE, Trevor Mudge, Fellow, IEEE Abstract -This paper presents a simulation-based performance study of several of the new high-performance DRAM architectures, each evaluated in a small system organization. These small-system organizations correspond to workstation-class c...