“…In contrast, in this work no manually annotated visual data is involved at any stage of our approach. To avoid labelling visual data, several approaches have leveraged audio transcripts obtained from narrated videos using automatic speech recognition (ASR) as a way to supervise video models for object detection [3,15,54], captioning [33,69], classification [2,42,47,86], summarization [57] or retrieval [50] using large-scale narrated video datasets such as How2 [65] or HowTo100M [50]. Others [10,30] have investigated learning from narrated videos by directly using the raw speech waveform instead of generating transcriptions.…”