A particle filter (PF) has been introduced for effective position estimation of moving targets for non-Gaussian and nonlinear systems. The time difference of arrival (TDOA) method using acoustic sensor array has normally been used to for estimation by concealing the location of a moving target, especially underwater. In this paper, we propose a GPU -based acceleration of target position estimation using a PF and propose an efficient system and software architecture. The proposed graphic processing unit (GPU)-based algorithm has more advantages in applying PF signal processing to a target system, which consists of large-scale Internet of Things (IoT)-driven sensors because of the parallelization which is scalable. For the TDOA measurement from the acoustic sensor array, we use the generalized cross correlation phase transform (GCC-PHAT) method to obtain the correlation coefficient of the signal using Fast Fourier Transform (FFT), and we try to accelerate the calculations of GCC-PHAT based TDOA measurements using FFT with GPU compute unified device architecture (CUDA). The proposed approach utilizes a parallelization method in the target position estimation algorithm using GPU-based PF processing. In addition, it could efficiently estimate sudden movement change of the target using GPU-based parallel computing which also can be used for multiple target tracking. It also provides scalability in extending the detection algorithm according to the increase of the number of sensors. Therefore, the proposed architecture can be applied in IoT sensing applications with a large number of sensors. The target estimation algorithm was verified using MATLAB and implemented using GPU CUDA. We implemented the proposed signal processing acceleration system using target GPU to analyze in terms of execution time. The execution time of the algorithm is reduced by 55% from to the CPU standalone operation in target embedded board, NVIDIA Jetson TX1. Also, to apply large-scaled IoT sensing applications, we use NVIDIA Tesla K40c as target GPU. The execution time of the proposed multi-state-space model-based algorithm is similar to the one-state-space model algorithm because of GPU-based parallel computing. Experimental results show that the proposed architecture is a feasible solution in terms of high-performance and area-efficient architecture.