The use rate of urban land is a significant sign to evaluate urban construction, and scene recognition has important application value in improving urban land use rate. In this paper, a new lightweight model based on VGG16 is proposed to extract distinct features of remote sensing images through five convolution modules. This model uses depthwise separable convolution to reduce the network parameters. An adaptive pooling layer is added to solve the inherent non-adaptive problem of the convolution network. It makes the network insensitive to the size of the input image. The global average pooling layer is used to sum the information to make the input spatial transformation more stable. This paper conducts training and testing on two data sets, NWPU-RESISC45 Dataset and SIRI-WHU Dataset, and the recognition scenarios are 13 and 12 categories. Experimental results show that this method is better than other models in recognition accuracy and model size.