Clickbait uses sensational or misleading headlines to attract readers, which can degrade information quality in online news. This study presents a comparative evaluation of BERT and DistilBERT for detecting clickbait headline structures in the Indonesian language using the CLICK-ID dataset. The approach examines how class imbalance influences performance by training models on multiple dataset variants created through oversampling, undersampling, and data augmentation. Inputs are tokenized with model specific tokenizers and evaluated with accuracy, precision, recall, and F1-score. Confusion matrices are used to interpret error patterns across classes. Experimental results show that DistilBERT trained on an oversampled dataset achieves 94% for accuracy, precision, recall, and F1-score, while BERT on the same oversampled setting reaches 93%. Models trained on unbalanced data yield the lowest recall and F1 for the clickbait class, confirming the adverse effect of skewed distributions. Augmented and undersampled variants produce slightly lower but competitive results in the 92% to 93% range. Error analysis shows that DistilBERT reduces missed clickbait while maintaining a similar level of false positives, producing more balanced behavior across classes. These results outperform prior CLICK-ID studies and highlight the advantage of transformer architectures combined with effective class balancing for Indonesian clickbait detection.
Copyrights © 2025