With the development of image processing technology, pencil drawing has been widely used in video games and mobile phone applications. However, the existing pencil drawing algorithms require a large amount of time to convert a real picture into a pencil drawing; hence, it is difficult to apply them to real-time systems. This paper proposes a parallel fast pencil drawing generation algorithm based on the graphics processing unit (GPU) to accelerate the real-time rendering process of sketch painting. The parallelism of the pencil drawing generation algorithm is identified via a theoretical analysis at first. Then, sub-algorithms of the sequential algorithm are designed in parallel using the compute unified device architecture (CUDA) programming model and executed via thread-level parallel techniques. Furthermore, an optimal cache pattern of data that reduce the access time of the most frequently used data is structured using shared memory and constant memory. Finally, task-level parallelism is achieved by the CUDA stream technology, which overlaps independent sub-tasks for further acceleration. On the CUDA platform, the experimental results demonstrate that the proposed parallel algorithm can achieve a significant increase in speedup. The proposed algorithm achieves a performance improvement of 448.59 times compared with the sequential algorithm, on 2560×1920-resolution images, and maintains a high degree of similarity with the real pencil paintings. Hence, the proposed algorithm is suitable for real-time pencil drawing rendering and has promising application prospects in non-photorealistic rendering. INDEX TERMS Non-photorealistic rendering, pencil drawing, parallel algorithm, GPU platform, convolution operation, CUDA.