Detecting malicious or adversarial images, for example in security and surveillance systems, is an important problem in computer vision. These results highlight the effectiveness of ViTs when compared to CNNs when confronting hostile images. However, CNNs have stiff competition from ViTs and have been the go-to architecture for image classification and object detection for many years, due to the existence of spatial hierarchies in images. Using benchmark datasets containing a combination of adversarial and clean images, this study compares the ability of both models to (i) detect hostile images, (ii) generalize to unseen dataset, and (iii) the overall computational efficiency of both models. While ViTs can be even more computationally expensive than incurred with task3 input, we demonstrate that, in fact, our architecture generalizes truncation -- both in power and action -- exceptionally well and can simply outperform performance-per-dollar in more robust pattern recognition tasks, especially under adversarial perturbations. In contrast, CNNs are faster to inference and less likely to overfit on small data. This finding informed decisions showing trade-offs between the two architectures, including a potential path for hybrid approaches and future enhancements in the adversarial defense against hostile image detection.
Copyrights © 2025