This study aims to evaluate the quality of digital tax services by analyzing the sentiment expressed in user reviews of the M-Pajak app. A dataset of 6,829 reviews was classified into negative, neutral, and positive sentiment, and the study tested the performance of the Random Forest and XGBoost algorithms. Although the test results showed high accuracy rates of 85.21% in the hold-out validation scheme and 90.14% in stratified k-fold cross-validation, an in-depth evaluation using a confusion matrix revealed significant model bias toward the majority class. Key findings indicate that these accuracy figures are misleading because both models completely failed to classify the neutral class, yielding extremely low F1-scores (0.00–0.11). This phenomenon confirms that the primary issue lies not in algorithm selection, but in the extreme data distribution imbalance and the ambiguity of rating-based labeling. The scientific contribution of this research lies in demonstrating that the evaluation of sentiment classification systems must go beyond conventional accuracy metrics. By prioritizing performance stability across each class, the resulting system is expected to provide fairer and more objective evaluation results for public data.
Copyrights © 2026