“…Similarly, to combine audio and visual modalities for unsupervised learning, existing works exploit the natural audio-visual correspondence in videos to formulate various self-supervised signals, which predict the cross-modal correspondence [314], [315], align the temporally corresponding representations [309], [316], [317], [318], or cluster their representations in a shared audio-visual latent space [208], [319]. Several works further explore audio, vision and language together for unsupervised representation learning by aligning different modalities in a shared multi-modal latent space [310], [320] or in a hierarchical latent space for audiovision and vision-language [308]. Open Challenges.…”