In this article, we propose a multimodal co-learning framework for building change detection. This framework can be adopted to jointly train a Siamese bitemporal image network and a height difference map (HDiff) network with labeled source data and unlabeled target data pairs. Three co-learning combinations (vanilla co-learning, fusion co-learning, and detached fusion colearning) are proposed and investigated with two types of colearning loss functions within our framework. Our experimental results demonstrate that the proposed methods are able to take advantage of unlabeled target data pairs and therefore enhance the performance of single-modal neural networks on the target data. In addition, our synthetic-to-real experiments demonstrate that the recently published synthetic dataset SMARS is feasible to be used in real change detection scenarios, where the optimal result is with the F1 score of 79.29%.