The deadliest type of skin cancer, melanoma, requires early and accurate detection for a successful course of treatment. Traditional diagnostic techniques, which rely on visual inspection and dermoscopy, are frequently arbitrary and prone to human error. Automated melanoma detection exemplifies the integration of multimedia, a truly interdisciplinary field that melds visual data processing, human-computer interaction, and digital technologies. This study presents a multi-modal architecture: a multi-modal transformer network (MMTN) and a convolutional attention mechanism multi-modal (CAMM) that combines clinical data and dermoscopy images to enhance melanoma detection. The models achieve higher performance compared to other approaches by utilizing the strengths of architecture based on transformers, an encoder for image processing, dense layers for clinical data also Spatial Attention for the second architecture proposed. We evaluate the models on the entire set of ISIC 2019 data, showing significant improvements in accuracy and AUC. The models achieve high accuracy and AUC using CPU in both architectures. Our findings highlight the potential of a multi-modal learning architecture to enhance clinical decision-making and diagnostic accuracy in dermatology. To our knowledge, this is the first implementation combining MobileNet, transformer encoder attention, and clinical data fusion for the ISIC 2019 dataset, providing a significant advancement in the automated categorization of skin malignancies.
Copyrights © 2026