In this paper, we propose a self-supervised contrastive learning method to learn video feature representations. In traditional self-supervised contrastive learning methods, constraints from anchor, positive, and negative data pairs are used to train the model. In such a case, different samplings of the same video are treated as positives, and video clips from different videos are treated as negatives. Because the spatio-temporal information is important for video representation, we set the temporal constraints more strictly by introducing intra-negative samples. In addition to samples from different videos, negative samples are extended by breaking temporal relations in video clips from the same anchor video. With the proposed Inter-Intra Contrastive (IIC) framework, we can train spatio-temporal convolutional networks to learn feature representations from videos. Strong data augmentations, residual clips, as well as head projector are utilized to construct an improved version. Three kinds of intra-negative generation functions are proposed and extensive experiments using different network backbones are conducted on benchmark datasets. Without using pre-computed optical flow data, our improved version can outperform previous IIC by a large margin, such as 19.4% (from 36.8% to 56.2%) and 5.2% (from 15.5% to 20.7%) points improvements in top-1 accuracy on UCF101 and HMDB51 datasets for video retrieval, respectively. For video recognition, over 3% points improvements can also be obtained on these two benchmark datasets. Discussions and visualizations validate that our IICv2 can capture better temporal clues and indicate the potential mechanism.