COVID-19 forced students to rely on online learning using multimedia tools, and multimedia learning continues to impact education beyond the pandemic. In this study, we combined behavioral, eye-tracking, and neuroimaging paradigms to identify multimedia learning processes and outcomes. College students viewed four video lectures including slides with either an onscreen human instructor, an animated instructor, or no onscreen instructor. Brain activity was recorded via fMRI, visual attention was recorded via eye-tracking, and learning outcome was assessed via post-tests. Onscreen presence of instructor, compared with no instructor presence, resulted in superior post-test performance, less visual attention on the slide, more synchronized eye movements during learning, and higher neural synchronization in cortical networks associated with socio-emotional processing and working memory. Individual variation in cognitive and socio-emotional abilities and intersubject neural synchronization revealed different levels of cognitive and socio-emotional processing in different learning conditions. The instructor-present condition evoked increased synchronization, likely reflecting extra processing demands in attentional control, working memory engagement, and socio-emotional processing. Although human instructors and animated instructors led to comparable learning outcomes, the effects were due to the dynamic interplay of information processing vs. attentional distraction. These findings reflect a benefit–cost trade-off where multimedia learning outcome is enhanced only when the cognitive benefits motivated by the social presence of onscreen instructor outweigh the cognitive costs brought about by concurrent attentional distraction unrelated to learning.