In this study, we propose a hybrid model for Perspective-n-Point (PnP)-based 6D object pose estimation called FusionNet that takes advantage of convolutional neural networks (CNN) and Transformers. CNN is an effective and potential tool for feature extraction, which is considered the most popular architecture. However, CNN has difficulty in capturing long-range dependencies between features, and most CNN-based models for 6D object pose estimation are bulky and heavy. To address these problems, we propose a lighter-weight CNN building block with attention, design a Transformer-based global dependency encoder, and integrate them into a single model. Our model is able to extract dense 2D–3D point correspondences more accurately while significantly reducing the number of model parameters. Followed with a PnP header that replaces the PnP algorithm for general end-to-end pose estimation, our model showed better or highly competitive performance in pose estimation compared with other state-of-the-art models in experiments on the LINEMOD dataset.