Diabetic Retinopathy (DR) is a complication of diabetes that can cause blindness if not detected early. CNN has limitations in capturing scattered lesions due to its narrow receptive field, while Vision Transformers are generally less computationally efficient. The objective of this study is to develop an approach that can capture long-range spatial dependencies while maintaining computational efficiency for resource-limited clinical applications. The Swin Transformer-Tiny was implemented with a shifted window-based hierarchical self-attention mechanism on the APTOS 2019 dataset (3,663 retinal images), with pre-processing (CLAHE, gamma correction, Gaussian filtering) and data augmentation. The model was trained using SGD with CosineAnnealingLR and evaluated based on accuracy, precision, recall, and F1-score with a focus on minimizing false negatives. Swin Transformer-Tiny achieved an accuracy of 84.99%, precision of 84.89%, and recall of 84.99%, surpassing EfficientNet-B0 by 1.32% in F1-score and outperforming ResNet50 by 5.60%. The attention mechanism reduces false negatives by 1.28% compared to conventional CNNs while maintaining linear computational complexity. This research contributes to showing that hierarchical self-attention in Swin Transformer effectively improves DR detection sensitivity by overcoming the limitations of CNN receptive fields, while maintaining computational efficiency for clinical implementation.
Copyrights © 2025