The past few years have seen the explosive and profound revolution in the field of digital image processing, where Transformer-based architectures have dominated a wide range of tasks and replaced the long-standing convolutional counterparts, because the self-attention mechanism in Transformer models, originating from natural language processing, is able to capture long-range spatial relationships in images much more effectively than the inherently limited receptive fields of Convolutional Neural Networks (CNNs). In this paper, we conduct a comprehensive systematic review of Transformer architectures for digital image processing from 2020 to 2026, and we cover the key foundational models, such as Vision Transformer (ViT), Swin Transformer, DeiT and BEiT, and their numerous variants. We follow the development path of these models from simple image classification to complex tasks including object detection, semantic and instance segmentation, image restoration, medical imaging, and generative image synthesis, and we identify four major trends in architectural designs, i.e., purely Transformer-based vision models, CNN-Transformer hybrid architectures, hierarchical windowed attention networks, and diffusion-Transformer fusion models. We also provide a structured comparative analysis of 42 influential methods on 18 benchmark datasets, including their performance trajectories, computational and memory trade-offs, and emerging best practices in model designs. Finally, we also elaborate on the open challenges, such as the quadratic computational cost of standard attention, requirement for large-scale pre-training data, and domain generalization limitations, and summarize the future directions, e.g., more efficient attention, tighter integration of multi-modal information, and light-weight Transformer designs for edge and resource-constrained devices, therefore, this review is a rigorous and timely reference for researchers and practitioners who are interested in improving visual intelligence with Transformer-based methods.
Copyrights © 2026