The discovery of inductive invariants lies at the heart of static program verification. Presently, many automatic solutions to inductive invariant generation are inflexible, only applicable to certain classes of programs, or unpredictable. An automatic technique that circumvents these deficiencies to some extent is candidate-based invariant generation, whereby a large number of candidate invariants are guessed and then proven to be inductive or rejected using a sound program analyzer. This paper describes our efforts to apply candidate-based invariant generation in GPUVerify, a static checker for programs that run on GPUs. We study a set of 383 GPU programs that contain loops, drawn from a number of open source suites and vendor SDKs. Among this set, 253 benchmarks require provision of loop invariants for verification to succeed. We describe the methodology we used to incrementally improve the invariant generation capabilities of GPUVerify to handle these benchmarks, through candidate-based invariant generation, using cheap static analysis to speculate potential program invariants. We also describe a set of experiments that we used to examine the effectiveness of our rules for candidate generation, assessing rules based on their generality (the extent to which they generate candidate invariants), hit rate (the extent to which the generated candidates hold), worth (the extent to which provable candidates actually help in allowing verification to succeed), and influence (the extent to which the success of one generation rule depends on candidates generated by another rule). We believe that our methodology may serve as a useful framework for other researchers interested in candidate-based invariant generation. The candidates produced by GPUVerify help to verify 231 of the 253 programs. This increase in precision, however, makes GPUVerify sluggish: the more candidates that are generated, the more time is spent determining which are inductive invariants. To speed up this process, we have investigated four under-approximating program analyses that aim to reject false candidates quickly and a framework whereby these analyses can run in sequence or in parallel. Across two platforms, running Windows and Linux, our results show that the best combination of these techniques running sequentially speeds up invariant generation across our benchmarks by 1.17× (Windows) and 1.01× (Linux), with per-benchmark best speedups of 93.58× (Windows) and 48.34× (Linux), and worst slowdowns of 10.24× (Windows) and 43.31× (Linux). We find that parallelizing the strategies marginally improves overall invariant generation speedups to 1.27× (Windows) and 1.11× (Linux), maintains good best-case speedups of 91.18× (Windows) and 44.60× (Linux), and, importantly, dramatically reduces worst-case slowdowns to 3.15× (Windows) and 3.17× (Linux).