Modality-specific Adaptive Scaling Method for Cross-modal Retrieval

Chen, Baitao; Ke, Xiao

doi:10.1109/icicml57342.2022.10009863

Cited by 1 publication

(1 citation statement)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Various steps are taken to enhance the effectiveness and accuracy of ViT algorithm including adjustment of model architecture, initialization of weights, and tunning hyperparameters after various iterations of model training. Adaptive scaling [29] was introduced in ViT architecture to enhance model performance through the process of image serialization [30]. Another novelty of this work is introduction of contrastive learning and adaptive scaling in ViT model.…”

Section: Methodsmentioning

confidence: 99%

Vision Transformer for Skin Cancer Identification Based on Contrastive Learning and Adaptive-Scale Fragmentation

Naeem,

Yang,

Sharif

et al. 2024

Preprint

View full text Add to dashboard Cite

The approach of image processing and deep learning has shown to be a breakthrough in the field of medical image diagnosis such as dermoscopic image analysis for skin cancer recognition and their classification. Skin cancer cases are increasing every year and pose a significant threat for health. In recent studies, convolutional neural network (CNN) has accomplished remarkable success in classifying skin cancer images. CNN is limited to extracting features from minor objects from input dermoscopic image and fails to pinpoint significant regions. Consequently, the researchers of this study have utilized vision transformers (VIT), known for their robust performance in conventional classification assignments. The self-attention mechanism (SAM) aims to enhance the significance of pivotal characteristics while modifying the influence of noise-inducing features. Specifically, an enhanced transformer network architecture has been introduced in this context. To assess its effectiveness, several enhancements have been applied to the model. Initially, a ViT network is implemented to evaluate its efficacy in identifying skin cancer. Subsequently, Adaptive-scale image fragmentation is utilized to sequentially process the image, emphasizing adaptive-scale features through patch embedding. Furthermore, contrastive learning is employed to ensure that similar skin cancer data is encoded differently, aiming for distinct encoding outcomes for different data. Skin cancer dataset namely ISIC 2019 is retrieved in this study, locally accessible at Kaggle’s official website. This dataset consists of dermoscopic images of skin cancer having several types: dermatofibroma, melanoma, actinic keratosis, basal cell carcinoma, nevus, vascular lesion, and pigmented benign keratosis. The ViT model has achieved 99.66% accuracy, 94.85% precision, 93.74% recall, and 94.52% f1-score. Three deep learning models Inception V3, MobileNet, and ResNet-50 were also applied with transfer learning approach as comparison to proposed ViT model for performance evaluation that resulted in 72%, 94.3, and 89% accuracies, respectively. The transformer network has shown remarkable success in natural language processing and in the domain of image analysis. These achievements establish a solid groundwork to classify skin cancer using multimodal data. This paper is confident to captivate the attention of medical researchers, computer engineers, dermatologists, and scholars across various related disciplines. Its insights promise to offer enhanced convenience for patients in their respective fields.

show abstract

Section: Methodsmentioning

confidence: 99%