Working memory (WM) may be an essential component of incidental vocabulary learning and retention from captioned videos. However, how WM affects young learners' incidental vocabulary learning under different types of captions remains unclear. The present study employs a between-subject research design. The main purpose is to examine how two types of WM-phonological short-term memory and complex WM-impact vocabulary learning outcomes incidentally learned and retained from three types of captioning: (1) glossed captions (GCs), (2) full captions (FCs), and (3) keyword captions (KCs). A total of 125 young learners (M age = 12.17, SD = 1.06) watched four videos and completed two vocabulary tests administered as pretest, posttest, and delayed tests. After treatment, participants completed two WM tasks: (1) an operation span test for measuring complex WM, and (2) a nonword repetition test for measuring phonological short-term memory. The findings reveal that (1) captioning types, particularly GCs, led to the best outcome in incidental vocabulary learning and retention, and (2) phonological WM provided a more predictive effect on incidental vocabulary learning and retention than complex WM. Phonological and complex WM may have different predictive effects on incidental vocabulary learning and retention under different types of captioning. Relevant implications were discussed based on these results.