Trace-driven simulation methodology is the most widely used method to evaluate the design of future computer memory architecture. Since this methodology demands large amounts of storage and computer time, there is a growing need for simulation methodologies to determine the memory system requirements of emerging workloads in a reasonable amount of time. Several techniques have been proposed to reduce the space that store memory reference and improve the performance of sequential trace-driven simulation. This paper presents the use of binary instrumentation as the memory reference generator and parallel simulation technique that based on the generic graphics processing unit (GPU). One way to achieve fast parallel simulation is to simulate the independent sets of a cache concurrently on different compute resource, but results show that this method is not efficient because of a high correlation of the activity between different sets. To put parallelism to effective use, we show that a multi-configuration simulation in single pass method gains 2.44x performance improvement compared to traditional sequential algorithm.