Sentiment analysis is vital for understanding consumer perception, yet Indonesian sentiment classification faces challenges due to labeled data scarcity and computational constraints. This study advances automatic labeling techniques and establishes performance benchmarks for Indonesian text. The research compares two labeling approaches InSet Lexicon and IndoBERT based Hugging Face pipeline on 8,447 Tapera-related opinions. Results show InSet Lexicon produced a highly skewed distribution (89.66% neutral), while the IndoBERT pipeline achieved a more balanced distribution (47.66% neutral, 38.43% positive, 13.91% negative).. Evaluation of various modeling strategies revealed that combining InSet Lexicon + TF-IDF with Naïve Bayes or Random Forest achieved scores above 85%. While RNN-LSTM reached >90% accuracy, it required significant resources. Notably, fine-tuning IndoBERT with optimal hyperparameters yielded the most robust performance, achieving 80–90% accuracy with a low validation loss of 0.1. The study concludes that for small datasets (<12,000 samples), the most effective strategies for Indonesian sentiment analysis are either the InSet Lexicon paired with traditional Machine Learning or automatic labeling using pre-trained models followed by rigorous fine-tuning.
Copyrights © 2026