The recent advancements in edge computing power are primarily attributable to technological innovations enabling accelerators with extensive hardware parallelism. One practical application is in computer imaging (CI), where GPU acceleration is pivotal, especially in reconstructing 2D images through techniques like Single-Pixel Imaging (SPI). In SPI, compressive sensing (CS) algorithms, deep learning, and Fourier transformation are essential for 2D image reconstruction. These algorithms derive substantial performance enhancements through parallelism, thereby reducing processing times. These techniques fully utilize the potential of the GPU by implementing several strategies. These include optimizing memory accessed, expanding loops for efficiency, designing effective computational kernels to reduce the number of operations, using asynchronous operations for better performance, and increasing the number of actively running threads and warps. In lab scenarios, integrating embedded GPUs becomes essential for algorithmic optimization on SoC-GPUs. This study focuses on quickly improving the Fast Hadamard Single-Pixel Imaging (FHSI) for 2D image reconstruction on Nvidia's Xavier platform. By implementing various parallel computing techniques in PyCUDA, we managed to speed up the process by approximately 10 times, significantly reducing processing times to nearly real-time levels.