This research presents an integrated approach for object detection and tracking in autonomous perception systems, combining deep learning techniques for object detection with sensor fusion and field programmable gate array (FPGA-based) hardware implementation of the Kalman filter. This approach is suitable for applications like autonomous vehicles, robotics, and augmented reality. The study explores the seamless integration of pre-trained deep learning models, sensor data from a depth camera, real-sense D435, and FPGA-based Kalman filtering to achieve robust and accurate 3D position and 2D size estimation of tracked objects while maintaining low latency. The object detection and feature extraction are implemented on a central processing unit (CPU), and the Kalman filter sensor fusion with universal asynchronous receiver transmitter (UART) communication is implemented on a Basys 3 FPGA board that performs 8 times faster compared to the software approach. The experimental result provides the hardware resource utilization of about 29% of look-up tables, 6% of lookup table RAMs (LUTRAM), 15% of Flip-flops, 32% of Block-RAM, 38% of DSP blocks operating at 100 MHz, and 230400 baud rates for the UART. The whole FPGA design executes at 2.1 milliseconds, the Kalman filter executes at 240 microseconds, and the UART at 1.86 milliseconds.