The Swin‐Transformer is a variant of the Vision Transformer, which constructs a hierarchical Transformer that computes representations with shifted windows and window multi‐head self‐attention. This method can handle the scale invariance problem and performs well in many computer vision tasks. In image retrieval, high‐quality feature descriptors are necessary to improve retrieval accuracy. This paper proposes a self‐ensemble Swin‐Transformer network structure to fuse the features of different layers of the Swin‐Transformer network, eliminating noise points present in a single layer, and improving the retrieval effect. Two experiments were conducted, one on the In‐shop Clothes Retrieval dataset and another on the Stanford Online Product dataset. The experiments showed that the proposed method significantly increased the retrieval effect of features extracted using Vision Transformer, surpassing previous state‐of‐the‐art image retrieval methods. In the second experiment, the feature map of the trained model was visualized, revealing that the improved network significantly reduces focus on some noise points and enhances focus on image features compared to the original network.