According to the World Health Organization global status report on road safety, traffic accidents are the eighth leading cause of death in the world, and nearly one-fifth of the traffic accidents were cause by driver distractions. Inspired by the famous two-stream convolutional neural network (CNN) model, we propose a driver behavior analysis system using one spatial stream ConvNet to extract the spatial features and one temporal stream ConvNet to capture the driver’s motion information. Instead of using three-dimensional (3D) ConvNet, which would suffer from large parameters and the lack of a pre-trained model, two-dimensional (2D) ConvNet is used to construct the spatial and temporal ConvNet streams, and they were pre-trained by the large-scale ImageNet. In addition, in order to integrate different modalities, the feature-level fusion methodology was applied, and a fusion network was designed to integrate the spatial and temporal features for further classification. Moreover, a self-compiled dataset of 10 actions in the vehicle was established. According to the experimental results, the proposed system can increase the accuracy rate by nearly 30% compared to the two-stream CNN model with a score-level fusion.