bit-Tech
Vol. 8 No. 2 (2025): bit-Tech

Comparative Analysis of IndoBERT, IndoBERTweet, and XLM-RoBERTa for Detecting Online Gambling Comments on YouTube

Kevin Iansyah (Universitas Pembangunan Nasional "Veteran" Jawa Timur)
Afina Lina Nurlaili (Universitas Pembangunan Nasional “Veteran” Jawa Timur)
Muhammad Muharrom Al Haromainy (Universitas Pembangunan Nasional “Veteran” Jawa Timur)



Article Info

Publish Date
10 Dec 2025

Abstract

The proliferation of online gambling promotions in YouTube comment sections poses significant challenges for content moderation on Indonesian digital platforms. Although transformer models have proven effective for various Indonesian-language NLP tasks, no systematic comparative evaluation exists for detecting online gambling promotions on YouTube, nor has research explored model sensitivity to hyperparameters in this context. This research identifies the optimal transformer model and configuration for detecting Indonesian-language online gambling promotion comments on YouTube. A total of 26,455 YouTube comments were collected from February to July 2025 and stratified into balanced training (18,926 comments) and validation sets (3,340 comments), plus an imbalanced testing set (4,189 comments with 28.05% promotions and 71.95% non-promotions) reflecting realistic platform conditions. Nine fine-tuning experiments were conducted with three transformer models (IndoBERT, IndoBERTweet, XLM-RoBERTa) using three learning rates (1e-5, 2e-5, 3e-5). Evaluation employed accuracy, precision, recall, F1-score, and AUC-ROC metrics. Results show IndoBERT with learning rate 1e-5 achieved best performance (F1-score 99.57%, recall 99.49%), outperforming IndoBERTweet (F1-score 98.58%) and XLM-RoBERTa (F1-score 99.28%). Interestingly, the formal corpus-trained model (IndoBERT) proved more effective than the social media model (IndoBERTweet), indicating that gambling promotion language patterns tend to be structured despite appearing in informal contexts. IndoBERT demonstrated greatest stability to learning rate variations (standard deviation 0.0011), while XLM-RoBERTa offered fastest inference time (2.48 ms) with optimal performance-efficiency balance. These findings provide practical recommendations for automated content moderation systems on Indonesian social media platforms, with IndoBERT for maximum accuracy scenarios and XLM-RoBERTa for large-scale real-time deployment.

Copyrights © 2025






Journal Info

Abbrev

bt

Publisher

Subject

Computer Science & IT

Description

The bit-Tech journal was developed with the aim of accommodating the scientific work of Lecturers and Students, both the results of scientific papers and research in the form of literature study results. It is hoped that this journal will increase the knowledge and exchange of scientific ...