Deep 3-dimensional (3D) Convolutional Network (Con-vNet) has shown promising performance on video recognition tasks because of its powerful spatio-temporal information fusion ability. However, the extremely intensive requirements on memory access and computing power prohibit it from being used in resource-constrained scenarios, such as portable and edge devices. So in this paper, we first propose a two-stage fully separable block to significantly compress the model sizes with little accuracy loss. Then a feature enhancement approach named temporal residual gradient is proposed to improve the compressed model performance on video tasks, which provides higher accuracy, faster convergency and better robustness. Moreover, in order to further decrease the computing workload, we propose a hybrid Fast Algorithm to incredibly reduce the computation complexity of convolutions. These methods are effectively combined to design a light-weight and efficient ConvNet for video recognition tasks. Experiments on the popular dataset report 2.3× compression rate, 6.8× workload reduction, and 2% top-1 accuracy gain, over the stateof-the-art SlowFast model, which is already a compactdesigned model. The proposed methods also show good adaptability on traditional 3D ConvNet, leading to 5× compact model, 10× less workload, and 3% higher accuracy.