Annotations for image segmentation are expensive and time-consuming. In contrast to image segmentation, the task of object detection is in general easier in terms of the acquisition of labeled training data and the design of training models. In this paper, we combine the idea of unsupervised learning and a pretrained object-detection network to perform image segmentation, without using expensive segmentation labels. Specially, we designed a pretext task based on the sparse decomposition of object instances in videos to obtain the segmentation mask of the objects, which benefits from the sparsity of image instances and the inter-frame structure of videos. To improve the accuracy of identifying the ’right’ object, we used a pretrained object-detection network to provide the location information of the object instances, and propose an Object Location Segmentation (OLSeg) model of three branches with bounding box prior. The model is trained from videos and is able to capture the foreground, background and segmentation mask in a single image. The performance gain benefits from the sparsity of object instances (the foreground and background in our experiments) and the provided location information (bounding box prior), which work together to produce a comprehensive and robust visual representation for the input. The experimental results demonstrate that the proposed model boosts the performance effectively on various image segmentation benchmarks.