Dermatoscopy image-based skin lesion classification is a challenge in dermatology due to the high visual variation between lesion types and the imbalanced class distribution in the dataset. In this study, a Hybrid Vision Transformer–ConvNeXt architecture is proposed, combining the global attention capability of Vision Transformer (ViT) and the spatial feature representation of ConvNeXt, to improve the classification performance of skin lesion images on the HAM10000 dataset. This study also applies Multi-Task Focal Loss, auxiliary classifier, and Weighted Random Sampler to effectively address the class imbalance. In addition, the Medical Test-Time Augmentation (TTA) approach is used in the inference stage to improve the stability of predictions. The model is trained using a two-stage strategy (head training and full fine-tuning), as well as optimization based on AdamW and Cosine Annealing Warm Restarts. The test results show that the proposed model successfully achieves a validation F1-Score of 0.8723, and after TTA it increases to 0.90, surpassing the baseline of ViT and single ConvNeXt. These findings indicate that the integration of ViT–ConvNeXt with loss strategy and medical TTA is able to significantly improve the performance of skin lesion classification, and has the potential to be applied as a clinical diagnosis support system.
Copyrights © 2025