Although multimodal input has the potential to lead to more sound learning outcomes, it carries the risk of causing cognitive overload, making it difficult to determine the exact effects of multimodal input on the second language (L2) phrase learning. This study tests the efficacy of multimodal input on L2 phrase learning. It adopts a mixed-method approach by utilizing both quantitative and qualitative data. The experimental design is a 2 × 3 mixed model, with a group [the experimental group (EG) and the control group (CG)] as the between-subject factor and time (pretest, midtest, and posttest) as the within-subject factor. A total of 66 participants were divided into two groups. All materials incorporated three aspects of phrase knowledge (form, meaning, and use), but the materials of the CG were unimodal in that they were offered only on paper, and of the EG were multimodal in that they included pictures, audio recordings, and video clips. After the treatment, a questionnaire and a semi-structured interview were given to the EG learners to explore their perceptions of using multimodal materials to learn L2 phrases. The results indicate that both groups had significant gains in learning phrases, but students with the multimodal input achieved significantly better results than those with the unimodal input. Moreover, the EG students had a generally positive attitude toward the use of multimodal resources. This study validates the efficacy of multimodal input on the acquisition of English phrases and shows that cognitive overload was avoided by sequencing the information.