Media Jurnal Informatika
Vol 17, No 2 (2025): Media Jurnal Informatika

Evaluating Machine Learning Models Across Feature Extraction and Data Balancing Scenarios for Coretax Sentiment Analysis

Syah Putra, Subhan (Unknown)
Riminarsih, Desti (Unknown)



Article Info

Publish Date
31 Dec 2025

Abstract

The implementation of the Core Tax Administration System (Coretax) by the Indonesian Directorate General of Taxes has generated diverse public responses on social media, particularly on platform X, making sentiment analysis a relevant approach to assess public perception of this policy. This study aims to evaluate the performance of machine learning classifiers across different feature extraction and data balancing scenarios. Three machine learning classifiers, namely Multinomial Naïve Bayes, Support Vector Machine (SVM), and Logistic Regression were evaluated under four experimental scenarios combining two feature extraction methods, namely Term Frequency–Inverse Document Frequency (TF-IDF) and Bag of Words (BoW), with original and balanced data distributions. A dataset of more than 50,000 Coretax-related posts collected from platform X was preprocessed and automatically labeled into positive, negative, and neutral sentiment classes using a pretrained IndoBERT sentiment model. A brief manual inspection of a random subset indicates moderate agreement between automatic and manual labels, highlighting potential noise while supporting the use of automatic labeling for comparative analysis. The results show that performance is shaped by the combined effects of representation and data distribution rather than algorithm choice alone. Logistic Regression consistently achieved the most stable and competitive performance across all scenarios, with accuracy values ranging from approximately 0.80 to 0.83 and macro F1-scores around 0.72–0.73. TF-IDF generally provided more stable performance, while data balancing improved prediction fairness for minority sentiment classes despite a slight decrease in overall accuracy. These findings demonstrate that Logistic Regression is the most robust model for Coretax sentiment analysis across varying feature extraction and data balancing conditions and provide practical insights into the influence of data representation and distribution on sentiment classification performance.

Copyrights © 2025






Journal Info

Abbrev

mjinformatika

Publisher

Subject

Computer Science & IT

Description

Media Jurnal Informatika merupakan oleh jurnal yang diterbitkan oleh Program Studi Teknik Informatika Universitas Suryakancana Cianjur yang terbit setiap 6 Bulan pada Juni dan Desember. Media Jurnal Informatika mulai terbit dengan versi cetak pada tahun 2009 dan terbit satu kali dalam satu tahun, ...