Recent developments in deep learning have facilitated the generation of visually convincing deepfake images, creating serious concerns for the reliability and security of digital media content. The primary challenge lies in detecting these sophisticated manipulations while handling imbalanced datasets, a common issue in deepfake detection research. This research focuses on designing a robust deepfake image classification model based on the Vision Transformer (ViT) architecture to differentiate between authentic and manipulated images. The main objectives are to: (1) adapt and fine-tune a pre-trained Vision Transformer for binary classification, (2) evaluate the effectiveness of Random Oversampling in addressing class imbalance while preventing data leakage, and (3) assess model performance using comprehensive metrics. Methods: A pre-trained Vision Transformer model (Deep-Fake-Detector-v2-Model) was adapted and fine-tuned using a dataset consisting of 190,335 images. To overcome the issue of class imbalance, a Random Oversampling strategy was applied exclusively to the training set after dataset splitting to prevent data leakage. The dataset was divided into training and testing subsets using an 80:20 ratio. During the training phase, data augmentation techniques such as image rotation, sharpness variation, and pixel normalization were employed. The model was trained for four epochs with a learning rate of 1×10⁻⁶ and a batch size of 32. Results: Experimental evaluation demonstrates that the proposed model achieves a classification accuracy of 94.46% on the test dataset. The model demonstrates high precision of 97.56% for fake images and 91.74% for real images, with corresponding recall rates of 91.21% and 97.72% respectively. The F1-score reaches 94.46% for both classes, indicating balanced performance. Novelty: This research presents a novel application of Vision Transformer architecture for deepfake detection, combining efficient transfer learning with strategic oversampling to handle imbalanced datasets while preventing data leakage. The study demonstrates that ViT-based models can effectively capture subtle manipulation artifacts in deepfake images, achieving superior performance compared to traditional convolutional neural network approaches.
Copyrights © 2026