Andika Dwiyanto, Felix
Unknown Affiliation

Published : 1 Documents Claim Missing Document
Claim Missing Document
Check
Articles

Found 1 Documents
Search

Multilingual Parallel Corpus for Indonesian Low-Resource Languages Sulistyo, Danang Arbian; Wibawa, Aji Prasetya; Prasetya, Didik Dwi; Ahda, Fadhli Almu’iini; Arya Astawa, I Nyoman Gede; Andika Dwiyanto, Felix
JOIV : International Journal on Informatics Visualization Vol 9, No 5 (2025)
Publisher : Society of Visual Informatics

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.62527/joiv.9.5.3412

Abstract

Indonesia has an extraordinary number of languages, with more than 700 regional languages such as Javanese, Madurese, Balinese, Sundanese, and Bugis. Despite the wealth of languages, digital resources for these languages remain scarce, making the preservation and accessibility of digital languages a significant challenge. Research was conducted to address this gap by building a multilingual parallel corpus consisting of more than 150,000 phrase pairs extracted from Bible translations in five regional languages in Indonesia. Rigorous preprocessing, normalization, and Unicode tokenization were performed to improve data quality and consistency. The encoder-decoder architecture was a key focus in the development of the NMT model. Evaluation focused on forward and backward translation directions, which were measured using BLEU scores. The results show that forward translation consistently outperforms backward translation. The Indonesian Javanese model produced a score of 0.9939 for BLEU-1 and 0.9844 for BLEU-4, indicating a high level of translation quality. In contrast, reverse translation tasks, such as translating from Sundanese to Indonesian, presented significant challenges, with BLEU-4 scores as low as 0.3173. This illustrates the complexity of the translation system from Indonesian to local languages. If future research focuses on transformer-based models and incorporates additional linguistic parameters to enhance the accuracy of natural language processing (NLP) models for Indonesia's underrepresented regional languages, this work provides a dataset that can be utilized for that purpose.