Facial recognition technologies are increasingly being applied across various fields to facilitate human activities through automated systems. However, the existing frameworks often rely on multi-stage model pipelines, escalating computational complexity. This study compares two robust deep learning architectures, namely Inception-ResNet-V1 and Swin Transformer. Both are implemented as classifiers on the CASIA-WebFace dataset, consisting of 100 identity classes. The initial detector employs a cascaded network for multi-task learning (MTCNN). The Swin Transformer has a superior precision of 97.16%, surpassing the 96.35% attained by Inception-ResNet-V1. Furthermore, the high F1-scores of 96.7% and 95.79%, respectively, highlight an equilibrium alongside a robust approach to classifying a large number of classes. Beyond accuracy, both models exhibit lower latency in GPU environments, specifically 13.91 ms for the Swin Transformer and 15.04 ms for Inception-ResNet-V1. That marks a significant practical contribution to simplifying biometric identification by eliminating the necessity for separate feature extraction and distance matching modules. These results suggest that the end-to-end method holds immense possibilities for daily situations, including high-security authentication as well as large-scale automation surveillance, where computational robustness and efficiency are critical. Nevertheless, advanced optimization remains crucial for such a demanding environment.
Copyrights © 2026