In recent years, there has been a surge of research in the field of medical image segmentation using hybrid CNN‐Transformer network architectures. Most of these studies leverage the attention mechanism of ViT to overcome the limitations of CNN architectures in capturing long‐range dependencies. However, these hybrid model approaches also have some potential drawbacks. First, due to the heavy reliance of the Transformer's attention mechanism on global information, it can lead to a significant increase in computational cost when dealing with high‐resolution input images. Furthermore, the convolutional and attention mechanisms in hybrid models have different interpretability in information extraction and decision‐making, which poses a challenge for the interpretability of the convolutional part. Our proposed model, DWHA, addresses these limitations and outperforms state‐of‐the‐art models in a range of medical image segmentation tasks, such as abdominal multiorgan segmentation, automatic cardiac diagnosis, neurostructure segmentation, and skin lesion segmentation, achieving significantly superior performance. Specifically, on the abdominal multiorgan segmentation dataset, DWHA outperformed the previous state‐of‐the‐art baseline by 0.57%; on the neurostructure segmentation dataset, it achieved an improvement of 1.17%; and on the skin lesion segmentation dataset, it achieved an improvement of 0.91%. These significant improvements suggest that DWHA may become the preferred model in the field of medical image segmentation.