Panoramic video and virtual reality technologies create learning environments that provide learners with an “immersive” experience. In recent years, panoramic video design to create immersive learning environments, in particular, has become an increasingly popular topic in teacher education and educational research. However, few studies have explored the elements of panoramic virtual learning environment screens regarding the design of learning environments. Therefore, this experimental study uses eye-tracking technology to investigate how learners are guided by panoramic video elements in a panoramic virtual learning environment. Participants (n = 90) were randomly assigned to one of six conditions: (1) no caption + live interpretation, (2) no caption + AI interpretation, (3) 120-degree caption + live interpretation, (4) 120-degree caption + AI interpretation, (5) static follow caption + live interpretation, and (6) static follow caption + AI interpretation. The results of the study show that when learners experience a panoramic virtual learning environment with different narration methods, the live interpretation method is more likely to attract learners’ attention and bring better emotion and experience than the AI interpretation method. When experiencing a panoramic virtual learning environment with different caption presentation methods, the caption presentation methods induced learners’ attention, learning emotions, and experiences in the order of no caption >120-degree caption > static following caption. Finally, the rules for optimizing the design of panoramic virtual learning environment screens are given based on the findings of the study, which provide new ideas for designing and developing panoramic video teaching resources.