Accurate recognition and extraction of rural residential land (RRL) is significant for scientific planning, utilization, and management of rural land. Very-High Resolution (VHR) Unmanned Aerial Vehicle (UAV) images and deep learning techniques can provide data and methodological support for the target. However, RRL, as a complex land use assemblage, exhibits features of different scales under VHR images, as well as the presence of complex impervious layers and backgrounds such as natural surfaces and tree shadows in rural areas. It still needs further research to determine how to deal with multi-scale features and accurate edge features in such scenarios. In response to the above problems, a novel framework named cascaded dense dilated network (CDD-Net), which combines DenseNet, ASPP, and PointRend, is proposed for RRL extraction from VHR images. The advantages of the proposed framework are as follows: Firstly, DenseNet is used as a feature extraction network, allowing feature reuse and better network design with fewer parameters. Secondly, the ASPP module can better handle multi-scale features. Thirdly, PointRend is added to the model to improve the segmentation accuracy of the edges. The research takes a plain village in China as the research area. Experimental results show that the Precision, Recall, F1 score, and Dice coefficients of our approach are 91.41%, 93.86%, 92.62%, and 0.8359, respectively, higher than other advanced models used for comparison. It is feasible in the task of high-precision extraction of RRL using VHR UAV images. This research could provide technical support for rural land planning, analysis, and formulation of land management policies.