Power equipment fault diagnostics is a critical aspect of ensuring the stability of the power grid system. However, it presents substantial challenges in obtaining labeled data that is spatial-temporal, multi-scale, and multi-domain, low noise for effective fault analysis and diagnosis. To address this issue, we propose a novel approach called hierarchical dynamic aggregation graph (HDAG) modeling for self-supervised fault diagnosis of power transformers using vibration data. Firstly, HDAG focuses on modeling the spatial and temporal correlations within the fault vectors, before converting them into time–frequency images for visualization. Secondly, our proposed fault diagnosis approach, (comprising the ST-sparse swin-transformer and multi-domain transformer fusion module), is integrated into the methodology. The ST-sparse swin transformer incorporates soft threshold modules, enabling the retention of relevant information while discarding irrelevant information. The multi-domain transformer fusion is designed to utilize the intra-domain and inter-domain signal characteristics to achieve a comprehensive feature representation. Finally, we present case studies based on experimental data that demonstrate the feasibility and effectiveness of our approach. Comparative evaluations against eight state-of-the-art techniques validate the improved information representation and diagnostic capabilities of our recommended strategy.