The rise of social media platforms such as TikTok has introduced new challenges in content moderation, particularly concerning the spread of offensive language and hate speech. One promising approach to addressing this issue is through automatic detection using deep learning technology. This study implements the Vision Transformer (ViT) to detect offensive language on the TikTok platform based on visual data in the form of comment screenshots. The dataset used consists of 1,401 labeled images categorized into two classes: offensive and non-offensive. The training process was conducted over 50 epochs without a validation split, and the evaluation was carried out using accuracy, precision, recall, and F1-score metrics. Results showed high performance, with an accuracy of 99.93%, precision of 0.9979, recall of 1.000, and F1-score of 1.000 at the 40th epoch, maintaining stability through the end of training. These findings demonstrate that ViT is effective in extracting visual features from image-based comments, even without access to raw text. This approach is particularly relevant in the context of TikTok, where comments often appear in visual formats such as thumbnails, screenshots, or reaction videos. This research opens up opportunities for the implementation of image-based offensive language detection systems that can enhance content moderation by adapting to various visual formats. Further development is recommended using a larger dataset and more systematic data splitting to test the model’s generalization capability.
Copyrights © 2025