Learning concentration, as a crucial factor influencing learning outcomes, provides the basis for learners’ self-regulation and teachers’ instructional adjustments and intervention decisions. However, the current research on learning concentration recognition lacks the integration of cognitive, emotional, and behavioral features, and the integration of interaction and vision data for recognition requires further exploration. The way data are collected in a head-mounted display differs from that in a traditional classroom or online learning. Therefore, it is vital to explore a recognition method for learning concentration based on multi-modal features in VR environments. This study proposes a multi-modal feature integration-based learning concentration recognition method in VR environments. It combines interaction and vision data, including measurements of interactive tests, text, clickstream, pupil facial expressions, and eye gaze data, to measure learners’ concentration in VR environments in terms of cognitive, emotional, and behavioral representation. The experimental results demonstrate that the proposed method, which integrates interaction and vision data to comprehensively represent the cognitive, emotional, and behavioral dimensions of learning concentration, outperforms single-dimensional and single-type recognition results in terms of accuracy. Additionally, it was found that learners with higher concentration levels achieve better learning outcomes, and learners’ perceived sense of immersion is an important factor influencing their concentration.