This study presents a systematic evaluation of Indonesian BERT models across multiple natural language processing (NLP) tasks, including named entity recognition (NER), sentiment analysis (SA), emotion classification (EmoT), and hate speech detection (HS). Unlike prior studies that primarily focus on effectiveness metrics, this work incorporates both effectiveness (F1-Macro and accuracy) and efficiency (training time and memory usage) to provide a more comprehensive benchmark. Experimental results show that IndoRoBERTa achieves the highest overall F1-Macro (0.826), indicating strong generalization across tasks, while IndoNLU attains the highest accuracy (0.833), suggesting better performance on dominant classes. IndoLEM demonstrates superior efficiency with the lowest training time (988.68 seconds) and minimal GPU memory usage (4.00 GB), making it suitable for resource-constrained environments. In contrast, the multilingual mBERT model exhibits higher computational cost with comparatively lower efficiency. The findings highlight a trade-off between performance and computational efficiency, where monolingual Indonesian models consistently outperform multilingual models in both effectiveness and resource utilization. These results provide practical insights for selecting appropriate pretrained language models based on task requirements and computational constraints in Indonesian NLP applications. Keywords - BERT; Indonesian NLP; model efficiency; multi-task evaluation
Copyrights © 2026