Weakly supervised semantic segmentation (WSSS) using only image-level labels is a challenging task. Most existing methods utilize class activation map (CAM) to generate pixel-level pseudo labels for supervised training. However, the gap between classification and segmentation hinders the network from obtaining more comprehensive semantic information and generating more accurate pseudo masks for segmentation. To address this issue, we propose TSD-CAM, a transformer-based self distillation (SD) method that utilizes CAM similarity. TSD-CAM uses the similarity between CAMs generated from different views as a distillation target, providing additional supervision for the network and narrowing the gap between classification and segmentation. SD supervision allows the network to acquire more semantic information and refine CAMs to generate higher precision pseudo-labels. In addition, we propose the adaptive pixel refinement module, which adaptively refines and adjusts images based on pixel variations, further improving the precision of pseudo labels. Our method is a fully end-to-end single-stage approach that achieves stateof-the-art 71.3% mIoU on PASCAL VOC 2012 and 42.9% mIoU on the MS COCO 2014 dataset, and the proposed TSD-CAM can significantly outperform other singlestage competitors and achieve comparable performance with state-of-the-art multistage methods. Meanwhile, the effectiveness of our method is demonstrated by a large number of ablation experiments, and we provide a new way of thinking to solve the problems of WSSS. Our code is available at: https://github.com/pipizhum/ TSD-CAM.