“…This self-supervised training strategy does not depend on data-specific augmentations and is agnostic to the data modality. Starting from its success for natural language models such as BERT [7], it has made its way into vision [9,12,36], 3D data processing [14,16,38], and many other domains [4,10,13,21]. MAE also has been successfully employed in multimodal learning [4,5,11], yet, again they rely on different encoders for each data modality.…”