In recent years, an increasing number of studies have focused on learning vocabulary from audiovisual input. They have shown that learners can pick up new words incidentally when watching TV (Peters & Webb, 2018; Rodgers & Webb, 2019). Research has also shown that on‐screen text (first language or foreign language subtitles) might increase learning gains (Montero Perez, Peters, Clarebout, & Desmet, 2014; Winke, Gass, & Sydorenko, 2010). Learning is sometimes explained in terms of the beneficial role of on‐screen imagery in audiovisual input (Rodgers, 2018). However, little is known about imagery’s effect on word learning and how it interacts with L1 subtitles and captions. This study investigates the effect of imagery in three TV viewing conditions: with L1 subtitles, with captions, and without subtitles. Data were collected with 142 Dutch‐speaking learners of English as a foreign language. A pretest‐posttest design was adopted in which learners watched a 12‐minute excerpt from a documentary. The findings show that the captions group made the most vocabulary learning gains. Moreover, imagery was positively related to word learning. This means that words that were shown in close proximity to the aural occurrence of the words were more likely to be learned.