The rise of user-generated content on social media and live-streaming platforms has intensified the spread of spam, particularly online gambling (Judi Online) promotions, which remain prevalent in Indonesian comment sections. This study investigates the effectiveness of various machine learning (ML) and deep learning (DL) approaches in classifying such spam content in Bahasa Indonesia. We compare five models: Support Vector Machine (SVM), Random Forest (RF), a CNN-based model, IndoBERT, and a custom lightweight transformer model named Wordformer. While IndoBERT achieves the highest performance across all metrics, it comes with high computational demands. Wordformer, in contrast, delivers a strong balance between accuracy and efficiency, outperforming traditional models while being significantly more lightweight than IndoBERT. Wordformer achieved 0.9975 accuracy and macro F1-score, surpassing SVM (0.9578) and Random Forest (0.9729), while maintaining a significantly smaller model size and fewer multiply-add operations. An extensive ablation study further explores the architectural and training design choices that influence Wordformer’s performance. The findings suggest that lightweight transformer models can offer practical, scalable solutions for spam detection in low-resource language settings without the need for large pretrained backbones.
Copyrights © 2025