The fusion of optical and synthetic aperture radar (SAR) images is a promising method to extract urban impervious surface (IS) accurately. Previous studies have shown that the feature-level fusion of optical and SAR images can significantly improve IS extraction. However, they generally use simple layer stacking for features fusion, ignoring the interaction between optical and SAR images. Besides, most of the features they used are shallow features manually extracted, such as texture and geometric features, lacking the use of high-level semantic features of images. The lack of publicly available IS datasets is considered as an obstacle that prevents the extensive use of deep learning models in IS extraction. Therefore, this study first creates an open and accurate IS dataset based on optical and SAR images, and then proposes a semantic segmentation network based on cross fusion of optical and SAR images features, namely CroFuseNet, for IS extraction. In CroFuseNet, we design a cross fusion module (CFM) to fuse features of optical and SAR images to achieve better complementarity between the two types of images. And we propose a multimodal features aggregation (MFA) module to aggregate specific high-level features from optical and SAR images. To validate the proposed CroFuseNet, we compare it with two classical machine learning algorithms and four state-of-theart deep learning models. The proposed model has the highest accuracy, with OA, MIoU and F1-Score of 97.77%, 0.9495 and 0.9770 respectively. The quantitative and qualitative experimental results demonstrate that the proposed model is superior to these comparative algorithms.