Preserving regional languages is a strategic step in preserving cultural heritage while expanding access to knowledge across generations. One approach that can support this effort is the application of automatic translation technology to digitize and learn local language texts. This study compares two tokenization strategies, word-based and character-based on a Kawi–Indonesian translation model using the FLAN-T5-Small Transformer architecture. The dataset used consists of 4,987 preprocessed sentence pairs, trained for 10 epochs with a batch size of 8. Statistical analysis shows that Kawi texts have an average length of 39.6 characters (5.4 words) per sentence, while Indonesian texts have an average length of 54.9 characters (7.5 words). These findings suggest that Kawi sentences tend to be lexically dense, with low word repetition and high morphological variation, which can increase the learning complexity of the model. Evaluation using BLEU and METEOR metrics shows that the model with word-based tokenization achieved a BLEU score of 0.45 and a METEOR score of 0.05, while the character-based model achieved a BLEU score of 0.24 and a METEOR score of 0.04. Although the dataset size has increased compared to previous studies, these results indicate that the additional data is not sufficient to overcome the limitations of the semantic representation of the Kawi language. Therefore, this study serves as an initial baseline that can be further developed through subword tokenization approaches, dataset expansion, and training strategy optimization to improve the quality of local language translations in the future.
Copyrights © 2025