Recent advancements in 3D object detection using light detection and ranging (LiDAR)-camera fusion have enhanced autonomous driving perception. However, aligning LiDAR and image data during multimodal fusion remains a significant challenge. We propose a novel multi-modal feature alignment and fusion architecture to effectively align and fuse voxel and image data. The proposed architecture comprises four key modules. Z -axis attention aggregates voxel features along the vertical axis using self-attention. Voxel-domain deformable encoder improves context understanding with deformable attention to encode voxel features. Dual-domain deformable feature alignment uses deformable attention to adaptively align voxel and image features, addressing resolution mismatches. Finally, gated fusion utilizes a gating mechanism to dynamically fuse aligned features. The multi-layer design further enhances feature detail retention and improves dual-domain fusion performance. Experimental results show our method increases average precision by 2.41% at the "hard" difficulty level for cars on the KITTI test set. On the KITTI validation set, mean average precision improves by 1.06% for cars, 6.88% for pedestrians, and 1.83% for cyclists.