Energy saving and optimization play an increasingly important role in industrial electronic systems. A heterogeneous embedded system is composed of a general-purpose central processing unit (CPU) with an enhanced module of graphics processing units (GPU). This paper explores the effective strategies of task granularity and software prefetching for energy optimization. We propose a novel energy optimization model for GPU-based embedded systems by harnessing a communication-based pipeline spatial and temporal relation. We analyze the characteristics of a multiple thread execution of parallel GPUs. We present an effective algorithm for the dynamic power optimization with the adaptively adjusted distance of software prefetching. The experimental results show that the dynamic energy consumption can be saved by 22.1% and 21.8% respectively under two prefetching strategies (register and shared memory) without loss of performance. We demonstrate the effectiveness of the proposed methods for energy saving and consumption reduction of performance driven computing in industrial scenarios.