With the huge development of stereoscopic video technology, the research of stereoscopic video quality assessment (SVQA) has become very important for promoting the development of stereoscopic video system. These years, many SVQA methods based on convolutional neural network (CNN) have emerged. In this paper, we proposed a multi-scale feature-guided 3D convolutional neural network for SVQA which not only use 3D convolution to capture spatio-temporal features but also aggregate multi-scale information by a new multi-scale unit. Besides, we employ a multi-stage growing attention mechanism in this network to learn more critical deep semantic information. The proposed method is tested on two public stereoscopic video quality datasets, and the result shows that this method correlates highly with human visual perception and outperforms stateof-the-art methods by a large margin.