Garuda - Garba Rujukan Digital

Building of Informatics, Technology and Science

Vol 7 No 4 (2026): March 2026

Tirto Prakoso, Dipo Paudro (Unknown)
Sugiyanto, Sugiyanto (Unknown)

Publish Date
06 Mar 2026

Recent developments in deep learning have facilitated the generation of visually convincing deepfake images, creating serious concerns for the reliability and security of digital media content. The primary challenge lies in detecting these sophisticated manipulations while handling imbalanced datasets, a common issue in deepfake detection research. This research focuses on designing a robust deepfake image classification model based on the Vision Transformer (ViT) architecture to differentiate between authentic and manipulated images. The main objectives are to: (1) adapt and fine-tune a pre-trained Vision Transformer for binary classification, (2) evaluate the effectiveness of Random Oversampling in addressing class imbalance while preventing data leakage, and (3) assess model performance using comprehensive metrics. Methods: A pre-trained Vision Transformer model (Deep-Fake-Detector-v2-Model) was adapted and fine-tuned using a dataset consisting of 190,335 images. To overcome the issue of class imbalance, a Random Oversampling strategy was applied exclusively to the training set after dataset splitting to prevent data leakage. The dataset was divided into training and testing subsets using an 80:20 ratio. During the training phase, data augmentation techniques such as image rotation, sharpness variation, and pixel normalization were employed. The model was trained for four epochs with a learning rate of 1×10⁻⁶ and a batch size of 32. Results: Experimental evaluation demonstrates that the proposed model achieves a classification accuracy of 94.46% on the test dataset. The model demonstrates high precision of 97.56% for fake images and 91.74% for real images, with corresponding recall rates of 91.21% and 97.72% respectively. The F1-score reaches 94.46% for both classes, indicating balanced performance. Novelty: This research presents a novel application of Vision Transformer architecture for deepfake detection, combining efficient transfer learning with strategic oversampling to handle imbalanced datasets while preventing data leakage. The study demonstrates that ViT-based models can effectively capture subtle manipulation artifacts in deepfake images, achieving superior performance compared to traditional convolutional neural network approaches.

Citation Download

EndNote, Reference Manager, ProCite

Latex, Jabref

Check in Google Scholar

Journal Info

Building of Informatics, Technology and Science

Website

Abbrev

bits

Publisher

Forum Kerjasama Pendidikan TInggi

Subject

Computer Science & IT

Description

Building of Informatics, Technology and Science (BITS) is an open access media in publishing scientific articles that contain the results of research in information technology and computers. Paper that enters this journal will be checked for plagiarism and peer-rewiew first to maintain its quality. ...

Article Info

Abstract

Deep Fake Image Detection Using Vision Transformer with Random Oversampling Technique

Article Info

Abstract