Transformer models have significantly advanced deep learning by introducing parallel processing and enabling the modeling of long-range dependencies. Despite their performance gains, their high computational and memory demands hinder deployment in resource-constrained environments such as edge devices or real-time systems. This review aims to analyze and compare Transformer architectures by categorizing them into encoder-only, decoder-only, and encoder-decoder variants and examining their applications in natural language processing (NLP), computer vision (CV), and multimodal tasks. Representative models BERT, GPT, T5, ViT, and MobileViT are selected based on architectural diversity and relevance across domains. Core components including self-attention mechanisms, positional encoding schemes, and feed-forward networks are dissected using a systematic review methodology, supported by a visual framework to improve clarity and reproducibility. Performance comparisons are discussed using standard evaluation metrics such as accuracy, F1-score, and Intersection over Union (IoU), with particular attention to trade-offs between computational cost and model effectiveness. Lightweight models like DistilBERT and MobileViT are analyzed for their deployment feasibility. Major challenges including quadratic attention complexity, hardware constraints, and limited generalization are explored alongside solutions such as sparse attention mechanisms, model distillation, and hardware accelerators. Additionally, ethical aspects including fairness, interpretability, and sustainability are critically reviewed in relation to Transformer adoption across sensitive domains. This study offers a domain-spanning overview and proposes practical directions for future research aimed at building scalable, efficient, and ethically aligned. Transformer-based systems suited for mobile, embedded, and healthcare applications.
Copyrights © 2025