The 6D pose estimation based on RGB-D data holds significant application value in computer vision and related fields. Currently, deep learning methods commonly employ convolutional networks for feature extraction, which are sensitive to keypoints at close distances but overlook information related to keypoints at longer distances. Moreover, in subsequent stages, there is a failure to effectively fuse spatial features (depth channel features) and color-texture features (RGB channel features). Consequently, this limitation results in compromised accuracy in existing 6D pose networks based on RGB-D data. In addressing this issue, this paper proposes a novel end-to-end network. Specifically, in the depth value extraction branch, a mask vector attention mechanism is used to establish global spatial weights, achieving robust extraction of depth values. In the fusion stage, a symmetrical fusion module is introduced, which utilizes a cross-attention mechanism to integrate spatial features and color-texture features, enhancing the efficiency of target representation and achieving self-correlated fusion between modalities, thereby effectively improving the accuracy of 6D pose estimation. Experimental evaluations were conducted on the LINEMOD and LINEMOD-OCLUSION datasets, and the results indicate that the proposed method achieved ADD(-S) scores of 95.84% and 47.89%, respectively. Compared to state-of-the-art methods, our approach demonstrates superior performance in pose estimation for objects with complex shapes. Moreover, in the presence of occlusion, the accuracy of pose estimation for asymmetric objects has been effectively improved.