In the process of Canny edge detection, a large number of high complexity calculations such as Gaussian filtering, gradient calculation, non-maximum suppression, and double threshold judgment need to be performed on the image, which takes up a lot of operation time, which is a great challenge to the real-time requirements of the algorithm. In order to solve this problem, a fine-grained parallel Canny edge detection method is proposed, which is optimized from three aspects: task partition, vector memory access, and NDRange optimization, and CPU-GPU collaborative parallelism is realized. At the same time, the parallel Canny edge detection methods based on multi-core CPU and CUDA architecture are designed. The experimental results show that OpenCL accelerated Canny edge detection algorithm can achieve 20.68 times, 3.96 times, and 1.21 times speedup ratio compared with CPU serial algorithm, CPU multi-threaded parallel algorithm, and CUDA-based parallel algorithm, respectively. The effectiveness and performance portability of the proposed Canny edge detection parallel algorithm are verified, and it provides a reference for the research of fast calculation of image big data.