“…Joint Embedding: Joint embedding models have shown excellent performance on several multimedia tasks, e.g., cross-modal retrieval [10,19,31,38,53,54,75,77,82,84], image captioning [35,49], image classification [23,25,32] video summarization [15,58], crossview matching [81]. Cross-modal retrieval methods require computing similarity between two different modalities, e.g., RGB and depth.…”