The emotional stance of the instructor in an educational video can influence the learning process. For this reason, we checked the first link of the cognitive-affective model of e-learning, namely, whether learners can recognize emotions that an instructor expresses only with their voice. Since English is not the native language for many learners and most instructional videos are produced in English, we tested for possible differences in emotion recognition between native and non-native speakers. We focused on positive emotions typically conveyed in such videos — enthusiasm and calmness. Native and non-native English speakers watched 12 short video clips about wood as a building material spoken by an instructor in different emotional tones — five videos expressed enthusiasm, five calmness, one boredom and one frustration. Participants rated the extent to which they thought the narrator expressed a specific emotion, the valence and activation level of the narration and solved an English vocabulary test. Both native and non-native speakers recognized the correct emotions (except for frustration), demonstrating the power of voice prosody to convey emotion in a multimedia learning scenario. Native speakers rated the enthusiastic videos more positively than non-native speakers, indicating a subtle difference in the way the two groups perceive emotions expressed through voice.