Khazanah Informatika: Jurnal Ilmu Komputer dan Informatika
Vol. 9 No. 2 October 2023

Automatic Language Identification for Indonesian-Malaysian Language Using Machine Learning

Abdiansah Abdiansah (Universitas Sriwijaya)
Muhammad Qurhanul Rizqie (Universitas Sriwijaya)



Article Info

Publish Date
29 Oct 2023

Abstract

Language Identification (LID) aims to guess or identify which language the text or sound is coming from. Language identification tends to be easier in languages with different characteristics (e.g., Indonesian and English), but not for languages with similar characteristics (e.g., Indonesian and Malaysian). Similar languages can cause ambiguity that will be a bias for machine learning. Using Support Vector Machine (SVM) technique, this research tried to identify the Indonesian or Malaysian language. The training and testing data are taken from Leipzig Corpora Collection and Twitter dataset. The feature representation technique uses TF-IDF, and the baseline testing uses Naive Bayes Multinomial. We used two training techniques: split (20:80) and 10-cross validation. The experimental results show that the accuracy between the baseline and SVM is not too far. Both provide accuracy of around 90% and above. The results indicate that Indonesian and Malaysian language identification accuracy is relatively high even though using simple techniques.

Copyrights © 2023






Journal Info

Abbrev

khif

Publisher

Subject

Computer Science & IT

Description

Khazanah Informatika: Jurnal Ilmiah Komputer dan Informatika, an Indonesian national journal, publishes high quality research papers in the broad field of Informatics and Computer Science, which encompasses software engineering, information system development, computer systems, computer network, ...