Pixels of clouds and cloud shadows in a remote sensing image impact image quality, image interpretation, and subsequent applications. In this paper, we propose a novel cloud removal method based on deep learning that automatically reconstructs the invalid pixels with the auxiliary information from multi-temporal images. Our method’s innovation lies in its feature extraction and loss functions, which reside in a novel gated convolutional network (GCN) instead of a series of common convolutions. It takes the current cloudy image, a recent cloudless image, and the mask of clouds as input, without any requirements of external training samples, to realize a self-training process with clean pixels in the bi-temporal images as natural training samples. In our feature extraction, gated convolutional layers, for the first time, are introduced to discriminate cloudy pixels from clean pixels, which make up for a common convolution layer’s lack of the ability to discriminate. Our multi-level constrained joint loss function, which consists of an image-level loss, a feature-level loss, and a total variation loss, can achieve local and global consistency both in shallow and deep levels of features. The total variation loss is introduced into the deep-learning-based cloud removal task for the first time to eliminate the color and texture discontinuity around cloud outlines needing repair. On the WHU cloud dataset with diverse land cover scenes and different imaging conditions, our experimental results demonstrated that our method consistently reconstructed the cloud and cloud shadow pixels in various remote sensing images and outperformed several mainstream deep-learning-based methods and a conventional method for every indicator by a large margin.