Skin cancer poses a significant health risk, affecting multiple layers of the skin, including the dermis, epidermis, and hypodermis. Melanoma, a severe type of skin cancer, originates from the abnormal proliferation of melanocytes in the epidermis. Current methods for skin lesion segmentation heavily rely on large annotated datasets, which are costly, time-consuming, and demand specialized expertise from dermatologists. To address these limitations and improve logistics in dermatology practices, we present a self-supervised strategy for accurate skin lesion segmentation in dermatologist images, eliminating the need for manual annotations. Unlike the traditional appraoch, our proposed approach integrates a hybrid CNN/Transformer model, harnessing the complementary strengths of both architectures. The Transformer module captures long-range contextual dependencies, enabling a comprehensive understanding of image content, while the CNN encoder extracts local semantic information. To dynamically recalibrate the representation space, we introduce a contextual attention module that effectively combines hierarchical features and pixel-level information. By incorporating local and global dependencies among image pixels, we perform a clustering process that organizes the image content into a meaningful space. Furthermore, as another contribution, we incorporate a spatial consistency loss to promote the gradual merging of clusters with similar representations, thereby improving the segmentation quality. Experimental evaluations conducted on two publicly available skin lesion segmentation datasets demonstrate the superiority of our proposed method, outperforming both unsupervised and self-supervised strategies, and achieving state-of-the-art performance in this challenging task.