Given aJbced CPU architecture and a fixed DRAM timing specification, there is still a large design space for a DRAM system organization. Parameters" include the number of memory channels, the bandwidth of each channel, burst sizes, queue sizes and organizations, turnaround overhead, memory-controller page protocol, algorithms .for assigning request priorities and scheduling requests dynamically, etc. In this design space, we see a wide variation in application execution times;for example, execution times for SPEC CPU 2000 integer suite on a 2-way ganged Direct Rambus organization (32 data bits) with 64-byte bursts are 10-20% lower than execution times on an otherwise identical configuration that uses 32byte bursts. This represents two system configurations that are relatively close to each other in the design space; performance differences become even more pronounced for designs further apart. This paper characterizes the sources of overhead in high-performance DRAM systems and investigates the most effective ways to reduce a system's exposure to performance loss. In particular, we look at mechanisms to increase a system's support for concurrent transactions, mechanisms to reduce request latency, and mechanisms to reduce the "system overhead"--the portion of the primary memory system's overhead that is not due to DRAM latency but rather to things like turnaround time, request queueing, inefficiencies due to read~write request interleaving, etc. Our simulator models a 2GHz, highly aggressive out-of-order uniprocessor. The interJhce to the memory system is fully non-blocking, supporting up to 32 outstanding misses at both the level-I and level-2 caches and split-transaction busses to all DRAM banks.