Background: Cutaneous melanoma, the most serious skin cancer, constitutes a considerable health burden in fair-skinned populations. Its increasing incidence highlights the need to develop automated computer-aided approaches that support dermatologists in melanoma detection. Recent advances in artificial intelligence (AI), jointly with the availability of public dermoscopy image datasets, have led to create data-driven models with high predictive performance, thus contributing to dermatology research. Despite these advantages, most image datasets present the class imbalance problem, where a few classes have numerous samples, whereas others are under-represented, impacting the performance of AI-based models.
Methods: In this paper, we propose to combine ensemble feature selection (FS) methods and data augmentation with the conditional tabular generative adversarial networks (CTGAN) to enhance melanoma identification in imbalanced datasets. We employed dermoscopy images from two public datasets, PH2 and Derm7pt, which contain melanoma and not-melanoma lesions. To capture intrinsic information from skin lesions, we conduct two feature extraction (FE) approaches, including handcrafted and embedding features. For the former, color, geometric and first-, second-, and higher-order texture features were extracted, while for the latter, embeddings were obtained using ResNet-based models. To alleviate the high-dimensionality in the FE, ensemble FS with filter methods were used and evaluated. For data augmentation, we conducted a progressive analysis of the imbalance ratio (IR), related to the amount of synthetic samples created, and evaluated the impact on the predictive results. To gain interpretability on predictive models, we used SHAP, bootstrap resampling statistical tests and UMAP visualizations.
Results: Experimental results showed that the combination of CTGAN and linear models achieved the best predictive results, achieving AUCROC values of 86% (with LASSO and IR=0.8) and 71% (with support vector machine and IR=0.9) for the PH2 and Derm7pt, respectively. We also identified that ‘melanoma’ lesions were mainly characterized by features related to color, while ‘not melanoma’ lesions were characterized by texture features.
Conclusions: Our results demonstrated the effectiveness of synthetic data to build generalizable models for melanoma prediction. Our work contributes to skin lesion research, helping in melanoma identification and the interpretation of main characteristics associated with melanoma.