“…With the development of vision transformers [18,23,39,50,55], Masked Image Modeling (MIM) gradually replaces the dominant position of contrastive learning [10,25,54] in visual self-supervised representation learning due to its superior fine-tuning performance in various visual downstream tasks. Many target signals have been designed for the mask-prediction pretext task in MIM, such as normalized pixels [24,60], discrete tokens [2,17], HOG feature [57], deep features [1,67] or frequencies [38,59]. However, they are all only applied as single-scale supervisions for reconstruction.…”