The effectiveness of cyberbullying detection is influenced by the availability of sufficient, diverse, and contextually rich training data, which is often limited in low-resource languages such as Indonesian. To address dataset limitations, researchers have extensively explored data augmentation (DA) as a promising approach to improving model performance. DA generates new data instances by applying transformations to existing data, thereby increasing both dataset size and variability. Prior studies have demonstrated that applying Easy Data Augmentation (EDA) with Support Vector Machine (SVM) classification improved cyberbullying detection performance, even when it faced challenges in capturing semantic and contextual nuances. In this paper, we investigated Indonesian DA methods using the Transformer-based GPT-2 model. The augmented sentences were evaluated and filtered based on context, semantics, diversity, and novelty, with similarity measures such as Euclidean Distance (ED), Cosine Similarity (CS), Jaccard Similarity (JS), and BLEU Score (BLS) ensuring the quality of the augmentation. Furthermore, we compared text classification performance using both SVM and the Transformer-based ALBERT model. Experimental results revealed that incorporating similarity measures and GPT-2 as a DA method failed to improve cyberbullying detection performance, potentially due to the semantic drift introduced by GPT-2 and the inadequacy of similarity measures in capturing nuanced contextual information. However, we found that ALBERT outperformed SVM as a classification model, achieving average F1-scores of 91.77% and 91.72%, respectively. This study contributes to the informatics field by exploring the potential of Transformer-based augmentation and similarity evaluation in enhancing low-resource text classification, while acknowledging the limitations in data quality and model adaptation.
Copyrights © 2026