Recently, object segmentation of remote sensing images has achieved great progress in many fields, such as transportation, natural resource, ecology, et al. A lot of works mainly performed object segmentation in fully-supervised mode. However, training models in such mode usually need craft large-scale annotations, which is usually an expensive work and costs much time. In this paper, a novel semi-supervised network for object segmentation of remote sensing images is proposed, which is only fed with a small amount of labeled data and relatively more unlabeled data. Rather than using the same architecture as previous semi-supervised works, we exploit two networks with different architectures, i.e. CNN and Transformer, as the cross-supervised models. Moreover, three types of loss functions, namely fully-supervised loss, cross-supervised loss and consistency loss, are introduced to enhance the model's robustness. The effectiveness of our proposed method is evaluated on two annotated remote sensing datasets, outperforming several state-of-the-art semi-supervised approaches.