This study asks whether a monolingual encoder can realistically outperform multilingual and larger transformer models for Indonesian automatic question generation (AQG) when all models share the same training budget. We compare Indonesian bidirectional encoder representations from transformers (IndoBERT), multilingual BERT (mBERT), and BERT-large using a single fine-tuning pipeline with answer highlighting, applied to an Indonesian version of TyDiQA-GoldP and a 20,000 translated subset of SQuAD 2.0. We evaluate model quality using bilingual evaluation understudy score n-gram 4 (BLEU-4), metric for evaluation of translation with explicit ordering (METEOR), and ROUGE-Lincoln (ROUGE-L). IndoBERT consistently achieves the best scores on both datasets (e.g., BLEU-4 of 19.69 on TyDiQA-GoldP and 3.79 on the SQuAD 2.0 subset) while requiring less computation than mBERT and BERT-large. Our results show that language-specific pretraining gives clear advantages for Indonesian AQG, yielding higher accuracy at lower computational cost than multilingual or larger encoders. The work closes a gap in Indonesian AQG benchmarking by providing the first head-to-head comparison of IndoBERT, mBERT, and BERT-large under a shared fine-tuning and evaluation protocol. For educational assessment, the findings offer a practical recipe for building deployable AQG systems on mid-range GPUs that generate higher quality questions without prohibitive training or inference budgets.
Copyrights © 2026