Thangka, as a distinctive form of painting in China, plays a crucial role in facilitating a more profound appreciation and comprehension of Thangka through automated image description. Considering the diverse semantic objects and varying scales present in Thangka images, as well as their distinct spatial distribution characteristics, along with the challenge of potential information loss in image key features using Transformer-based encoding layers, this paper proposes a novel approach for generating Thangka descriptions, integrating multi-scale and multi-level aggregation. The proposed method, named Multi-scale and Multi-level Aggregation (MMA), addresses these challenges and enhances the quality of Thangka image description. At the encoding stage, we employ asymmetric convolutions to enhance the spatial information-capturing capability of convolutional layers. Additionally, we utilize a pyramid pooling module to further integrate multi-scale contextual information from both global and local regions of Thangka images, resulting in feature representations that possess rich semantic information. In the decoding stage, a multi-level aggregation network is designed to aggregate features from different encoding layers, thereby improving the utilization of semantic information from higher-level encoding layers and content information from lower-level encoding layers. This effectively addresses the issue of information loss. The experimental results demonstrate that the proposed model achieves promising performance on the Thangka dataset. Compared to the NIC model, it achieves a significant improvement of 26.7% in BLEU-4 and 0.9% in METEOR, while generating descriptions with higher accuracy.