Direction of Arrival (DoA) estimation is essential to adaptive beamforming widely used in many radar and wireless communication systems. Although many estimation algorithms have been investigated, most of them focus on the performance enhancement aspect but overlook the computing complexity or the hardware implementation issues. In this paper, a low-complexity yet effective DoA estimation algorithm and the corresponding hardware accelerator chip design are presented. The proposed algorithm features a combination of signal sub-space projection and parallel matching pursuit techniques, i.e., applying signal projection first before performing matching pursuit from a codebook. This measure helps minimize the interference from noise sub-space and makes the matching process free of extra orthogonalization computations. The computing complexity can thus be reduced significantly. In addition, estimations of all signal sources can be performed in parallel without going through a successive update process. To facilitate an efficient hardware implementation, the computing scheme of the estimation algorithm is also optimized. The most critical part of the algorithm, i.e., calculating the projection matrix, is largely simplified and neatly accomplished by using QR decomposition. In addition, the proposed scheme supports parallel matches of all signal sources from a beamforming codebook to improve the processing throughput. The algorithm complexity analysis shows that the proposed scheme outperforms other well-known estimation algorithms significantly under various system configurations. The performance simulation results further reveal that, subject to a beamforming codebook with a 5° angular resolution, the Root Mean Square (RMS) error of angle estimations is only 0.76° when Signal to Noise Ratio (SNR) = 20 dB. The estimation accuracy outpaces other matching pursuit based approaches and is close to that of the classic Estimation of Signal Parameters Via Rotational Invariance Techniques (ESPRIT) scheme but requires only one fifth of its computing complexity. In developing the hardware accelerator design, pipelined Coordinate Rotation Digital Computer (CORDIC) processors consisting of simple adders and shifters are employed to implement the basic trigonometric operations needed in QR decomposition. A systolic array architecture is developed as the computing kernel for QR decomposition. Other computing modules are also realized using various linear systolic arrays and chained together seamlessly to maximize the computing throughput. A Taiwan Semiconductor Manufacturing Company (TSMC) 40 nm CMOS process was chosen as the implementation technology. The gate count of the chip design is 454.4k, featuring a core size of 0.76 mm 2 , and can operate up to 333 MHz. This suggests that one DoA estimation, with up to three signal sources, can be performed every 2.38 μs.