Although graphics processing units (GPUs) rely on threadlevel parallelism to hide long off-chip memory access latency, judicious utilization of on-chip memory resources, including register files, shared memory, and data caches, is critical to application performance. However, explicitly managing GPU on-chip memory resources is a non-trivial task for application developers. More importantly, as onchip memory resources vary among different GPU generations, performance portability has become a daunting challenge.In this paper, we tackle this problem with compilerdriven automatic data placement. We focus on programs that have already been reasonably optimized either manually by programmers or automatically by compiler tools. Our proposed compiler algorithms refine these programs by revising data placement across different types of GPU on-chip resources to achieve both performance enhancement and performance portability. Among 12 benchmarks in our study, our proposed compiler algorithm improves the performance by 1.76x on average on Nvidia GTX480, and by 1.61x on average on GTX680.