This study aims to evaluate and compare the performance of Convolutional Neural Networks (CNN) and Vision Transformers (ViT) in high-resolution image classification based on deep learning. The dataset consists of high-resolution images that undergo preprocessing and data augmentation, and is divided into training, validation, and testing sets. The CNN models used include ResNet50 and EfficientNet as baselines, while Vision Transformer is employed as a comparative model utilizing a self-attention mechanism. Performance evaluation is conducted using metrics such as accuracy, precision, recall, F1-score, as well as training and inference time. The results indicate that Vision Transformer achieves superior classification performance compared to CNN, with an accuracy of up to 93.85%. However, CNN demonstrates better computational efficiency with lower training and inference time. Furthermore, increasing image resolution improves the performance of both models, albeit at the cost of higher computational complexity, particularly for Vision Transformer. This study highlights a trade-off between accuracy and efficiency, suggesting that model selection should be aligned with specific application requirements.
Copyrights © 2025