Recently, deep learning techniques have substantially boosted the performance of salient object detection in still images. However, the salient object detection in videos by using traditional handcrafted features or deep learning features is not fully investigated, probably due to the lack of sufficient manually labeled video data for saliency modeling, especially for the data-driven deep learning. This paper proposes a novel weakly supervised approach to salient object detection in a video, which can learn a robust saliency prediction model by using very limited manually labeled data and a large amount of weakly labeled data that could be easily generated in a supervised approach. Furthermore, we propose a spatiotemporal cascade neural network (SCNN) architecture for saliency modeling, in which two fully convolutional networks are cascaded to evaluate visual saliency from both spatial and temporal cues to lead the optimal video saliency prediction. The proposed approach is extensively evaluated on the widely used challenging datasets, and the experiments demonstrate that our proposed approach substantially outperforms the state-of-the-art salient object detection models. Index Terms-Video saliency, weakly supervised learning, spatiotemporal prior fusion, cascade fully convolutional network I. INTRODUCTION S ALIENT object detection, which aims to identify the objects or regions that are noticeable and mostly attract human attention in an image/video, has become a research focus of computer vision for decades. It is generally as a preprocessing step to support high-level computer vision tasks, such as object segmentation, object recognition, object tracking and content-based video compression. A number of approaches have been proposed to detect salient objects. The recent approaches based on deep Convolutional Neural Networks (CNNs), e.g., [1]-[3], have substantially improved