Human pose estimation is a very important research topic in computer vision and attracts more and more researchers. Recently, ViTPose based on heatmap representation refreshed the state of the art for pose estimation methods. However, we find that ViTPose still has room for improvement in our experiments. On the one hand, the PatchEmbedding module of ViTPose uses a convolutional layer with a stride of 14 × 14 to downsample the input image, resulting in the loss of a significant amount of feature information. On the other hand, the two decoding methods (Classical Decoder and Simple Decoder) used by ViTPose are not refined enough: transpose convolution in the Classical Decoder produces the inherent chessboard effect; the upsampling factor in the Simple Decoder is too large, resulting in the blurry heatmap. To this end, we propose a novel pose estimation method based on ViTPose, termed RefinePose. In RefinePose, we design the GradualEmbedding module and Fusion Decoder, respectively, to solve the above problems. More specifically, the GradualEmbedding module only downsamples the image to 1/2 of the original size in each downsampling stage, and it reduces the input image to a fixed size (16 × 112 in ViTPose) through multiple downsampling stages. At the same time, we fuse the outputs of max pooling layers and convolutional layers in each downsampling stage, which retains more meaningful feature information. In the decoding stage, the Fusion Decoder designed by us combines bilinear interpolation with max unpooling layers, and gradually upsamples the feature maps to restore the predicted heatmap. In addition, we also design the FeatureAggregation module to aggregate features after sampling (upsampling and downsampling). We validate the RefinePose on the COCO dataset. The experiments show that RefinePose has achieved better performance than ViTPose.