Abstract. Parallel profiling tools, such as ThreadScope for Parallel Haskell, allow programmers to obtain information about the performance of their parallel programs. However, the information they provide is not always sufficiently detailed to precisely pinpoint the cause of some performance problems. Often, this is because the cost of obtaining that information would be prohibitive for a complete program execution. In this paper, we adapt the well-known technique of execution replay to make it possible to simulate a previous run of a program. We ensure that the non-deterministic parallel behaviour of the application is properly emulated while the deterministic functional code is run unmodified. In this way, we can gather additional data about the behaviour of a parallel program by replaying some parts of it with more detailed profiling information. We exploit this ability to identify performance bottlenecks in a quicksort implementation, and to derive a version that gives better speedups on multicore machines.