“…The goal of MJP is to enhance the capacity of position insensitivity that directly increases the difficulties of image recovery from gradient updates, while to preserve the accuracy on the standard classification in the pre-training. In practice, image masking methods are commonly applied in recent vision task [2,57,20,58,7], which is a useful and off-the-shelf strategy for self-supervised image reconstruction. Given an input image x ∈ R H×W ×C , we reshape it into a sequence of flattened 2D patches x p ∈ R N ×(P 2 •C) , where (H, W ) is the resolution of the original image, C is the number of channels, (P, P ) is the resolution of each image patch, and N = HW/P 2 is the resulting number of patches.…”