Weighted finite-state transducer (WFST) decoding in speech recognition can be accelerated by using graphics processing units (GPUs). To obtain a high recognition accuracy in a WFSTbased speech recognition system, a very large language model (LM) represented as a WFST with more than 10 GB of data is required. Since a GPU typically has only several GB of memory, it is impossible to store such a large LM in GPU memory. In this paper, we propose a new method for WFST decoding on a GPU. The method utilizes the on-the-fly rescoring algorithm, which performs the Viterbi search on a WFST with a small LM and rescores hypotheses using a large LM during decoding. We solve the problem of insufficient GPU memory by storing most of the large LM in a memory on the host and copying the data from the host memory to the GPU memory as needed during runtime. Our evaluation of the proposed method on the Lib-riSpeech test sets using an NVIDIA Tesla V100 GPU shows that it achieves a ten times faster decoding than an equivalent CPU implementation without recognition accuracy degradation.