This special issue is concerned with an area that has only recently attracted the attention of SLA researchers: the effects of multimodal input (image + voice + subtitles/captions; picture + voice + text) on second language (L2) learning. As stated by Ellis and Shintani (2014, p. 7), "learning can only take place when learners are exposed to input," but compared to the bulk of studies into written input, only few studies have investigated the potential of multimodal input. Recent research has shown that watching (subtitled) television and exposure to (and interaction with) audiovisual material enhances learners' L2 skills (e.g., Peters & Webb, 2018; Rodgers & Webb, 2017; Sockett, 2014). Studies on multimodal input are generally in line with theories of multimodal input (e.g., Mayer, 2009). However, these studies also raise many questions as to how and when learning is promoted and the individual differences that influence such learning. MULTIMODAL INPUT Drawing on Paivio's (1986) Dual Coding Theory, Mayer's (2014) cognitive theory of multimedia learning states that learning is better when information is processed in spoken as well as written mode because learners make mental connections between the aural and visual information provided there is temporal proximity. Examples in the domain of language learning are storybooks with pictures read aloud (e.g.