Balinese is a local language that is widely used and spoken by Balinese people, including on social media platforms. However, the nuances of its politeness levels are often lost in informal digital communication, and there is a significant lack of computational models that automatically classify these levels, particularly for low-resource languages such as Balinese. The primary objective of this study is to evaluate the performance of the Multinomial Naive Bayes method combined with Term Frequency–Inverse Document Frequency (TFIDF) feature extraction, Chi-square feature selection, and the Synthetic Minority Oversampling Technique (SMOTE) in classifying Balinese language levels. The dataset used in this study consists of 1,314 annotated social media posts and comments, primarily sourced from Instagram. A Balinese language expert performs the annotation, categorizing the texts into six levels that represent varying degrees of politeness and formality. These levels include alus singgih (polite, used for respecting others), alus sor (polite, used for self-humbling), alus mider (polite, used for both respecting others and self-humbling), alus madia (an intermediate level of politeness), basa andap (casual, commonly used in everyday life), and basa kasar (impolite, often used during arguments or toward animals). The experimental results show that the model achieves 96.53% accuracy on the training data and 61.45% accuracy on the test data. In addition, hyperparameter tuning reveals that the Multinomial Naive Bayes model with 2,720 selected features and SMOTE oversampling achieves 91.78% accuracy, significantly outperforming the baseline model without feature selection or oversampling, which achieves only 64.93% accuracy.
Copyrights © 2026