“…Recently there is great interest in joint vision-language tasks, e.g. captioning [63,28,9,25,2,70,62,61,73,68,13,47,55,8,32], visual question answering [3,72,41,69,56,66,59,76,77,19,64,24,60], and cross-domain retrieval [6,5,74,36]. These often rely on learned image-text embeddings.…”