Lack of supervision'' is a particularly challenging problem in E-learning or distance learning environments. A wide range of research efforts and technologies have been explored to alleviate its impact by monitoring students' engagement, such as emotion or learning behaviors. However, the current research still lacks multi-dimensional computational measures for analyzing learner's engagement from the interactions that occur in digital learning environment. In this paper, we propose an integrated framework to identify learning engagement from three facets: affect, behavior and cognitive state, which are conveyed by learner's facial expressions, eye movement behaviors and the overall performance during short video learning session. To recognize the three states of learners, three channel data is recorded: 1) video/image sequence captured by camera; 2) eye movement information from a non-intrusive and cost-effective eye tracker; and 3) click stream data from mouse. Based on these modalities, a multi-channel data fusion strategy is designed that concatenates time series features of three channels in the same time segment to predict course learning performance. We also presented a new method to make the self-reported annotations more reliable without using external observers' verification. To validate the approach and methods, 46 participants were invited to attend a representative on-line course that consists of short videos in our designed learning environment. The results demonstrated the effectiveness of the proposed framework and methods in monitoring learning engagement. More importantly, a prototype system was developed to detect learner's emotional and eye behavioral engagement in real-time as well as predict the learning performance of learners after they had completed each short video course.INDEX TERMS E-learning, engagement recognition, multi-channel data fusion, learning performance prediction.