Abstract. The simulation of realistic medical ultrasound imaging is a computationally intensive task. Although this task may be divided and parallelized, temporal and spatial dependencies make memory bandwidth a bottleneck on performance. In this paper, we report on our implementation of an ultrasound simulator on the Cell Broadband Engine using the Westervelt equation. Our approach divides the simulation region into blocks, and then moves a block along with its surrounding blocks through a number of time steps without storing intermediate pressures to memory. Although this increases the amount of floating point computation, it reduces the bandwidth to memory over the entire simulation which improves overall performance. We also analyse how performance may be improved by restricting the simulation to regions that are affected by the transducer output pulse and that influence the final scattered signal received by the transducer.