Remote sensing images contain important information such as airports, ports, and ships. By extracting remote sensing image features and learning the mapping relationship between image features and text semantic features, the interpretation and description of remote sensing image content can be realized, which has a wide range of application value in military and civil fields such as national defense security, land monitoring, urban planning, disaster mitigation and so on. Aiming at the complex background of remote sensing images and the lack of interpretability of existing target detection models, and the problems in feature extraction between different network structures, different layers, and the accuracy of target classification, we propose an object detection and interpretation model based on Gradient weighted class activation mapping and reinforcement learning. Firstly, ResNet is used as the main backbone network to extract the features of remote sensing images and generate feature graphs. Then, we add the global average pooling layer to obtain the corresponding feature weight vector of the feature graph. The weighted vectors are superimposed to output class activation maps. The reinforcement learning method is used to optimize the generated region generation network. At the same time, we improve the reward function of reinforcement learning to improve the effect of the region generation network. Finally, network dissecting analysis is used to obtain the interpretable semantic concept in the model. Through experiments, the average accuracy is more than 85%. Experimental results in the public remote sensing image description dataset show that the proposed method has high detection accuracy and good description performance for remote sensing images in complex environments.