The current study highlights the potential of multimodal large language models (MLLMs) to transform higher education by identifying key factors influencing their acceptance and effectiveness. Aligning technology features with educational needs can enhance student engagement and learning outcomes. The study examined the role of MLLMs in enhancing performance benefits among higher education students, using the task–technology fit (T-TF) theory and the artificial intelligence device use acceptance (AIDUA) model. A structured questionnaire was used to assess the perceptions of 550 Saudi university students from various academic disciplines. The data were analyzed via structural equation modeling (SEM) using SmartPLS 3.0. The findings revealed that social influence negatively affected effort expectancy regarding MLLMs and that hedonic motivation was also negatively related to effort expectancy. The findings revealed that social influence and hedonic motivation negatively affected effort expectancy for MLLMs. Effort expectancy was also negatively associated with T-TF in the learning context. In contrast, task and technology characteristics significantly influenced T-TF, which positively impacted both performance benefits and the willingness to accept the use of MLLMs. A strong relationship was found between adoption willingness and improved performance benefits. The findings empower educators to strategically enhance MLLMs adoption strategically, driving transformative learning outcomes.