Indonesia, as an archipelagic country, has a wide variety of languages, with 718 regional languages. However, many regional languages face the risk of declining usage and even extinction. Technological developments have opened up opportunities to analyze the patterns and unique characteristics of regional languages through n-gram analysis using naive bayes and k-nearest neighbor algorithms. Therefore, this study was conducted with the aim of analyzing the similarity of regional languages, particularly Central Javanese, Sundanese, and Pontianak Malay, as part of an effort to assist in the preservation of regional languages in Indonesia. The similarity between languages was calculated based on errors in the confusion matrix, and the performance of the algorithms was evaluated using accuracy and F1-score metrics. The naive bayes algorithm with combined unigram and bigram features showed the best performance with an accuracy and F1-score of 0.921. The results of the study showed the highest similarity value in the ‘Javanese - Malay’ language, although only 3.82%, and the lowest in the ‘Malay - Sundanese’ language at 1.66%. These similarity values are based on the dominant characters that appear in a language, such as ‘e’ in Malay and ‘a’ and ‘u’ in Sundanese. This study proves that there is little similarity between Javanese, Sundanese, and Malay.
Copyrights © 2025