To exploit the ever increasing compute capabilities offered by GPU hardware, GPU-compute workloads have evolved from simple computational kernels to large-scale programs with complex software stacks and numerous kernels. Driving architecture exploration using real workloads hence becomes increasingly challenging, up to the point of becoming intractable because of extremely long simulation times using existing architecture simulators. Sampling is a widely used technique to speed up simulation, however, the state-of-the-art sampling method for GPU-compute workloads, Principal Kernel Selection (PKS), falls short for challenging GPU-compute workloads with a large number of kernels and kernel invocations.This paper presents Sieve, an accurate and low-overhead stratified sampling methodology for GPU-compute workloads that groups kernel invocations based on their instruction count, with the goal of minimizing the execution time variability within strata. For the challenging Cactus and MLPerf workloads, we report that Sieve achieves an average prediction error of 1.2% (and at most 3.2%) versus 16.5% (and up to 60.4%) for PKS on real hardware (Nvidia Ampere GPU), while maintaining a similar simulation speedup of three orders of magnitude. We further demonstrate that Sieve reduces profiling time by a factor of 8× (and up to 98×) compared to PKS.