The present study examined incidental vocabulary learning (IVL) while considering word-related factors (i.e., word occurrence frequency and word relevance) and learner-related factors (i.e., English proficiency and prior vocabulary knowledge) in different input modes: reading, listening, and viewing captioned videos. Participants were 123 second-year university students learning English as a foreign language (EFL) in China. The participants were randomly assigned to four groups, i.e., three experimental groups of reading, listening, viewing with captions, and a control group. A YouTube video was used as the materials for the three experimental groups. The participants encountered 48 target words in the materials. The control group took the tests without attending the intervention. Learning outcome was based on two tests that measure word meaning recall and recognition while word occurrence frequency, word relevance, vocabulary knowledge, and proficiency were considered. The results indicate that the caption viewing condition was most effective, followed by the reading and listening conditions, in the incidental learning of meaning recall and recognition. The findings also suggest that frequency, word relevance, proficiency, and vocabulary knowledge significantly influenced the IVL outcomes for the immediate posttest. However, their impact was less straightforward for the delayed posttest. Relevant implications based on these findings were discussed.