Garuda - Garba Rujukan Digital

Media Jurnal Informatika

Vol 17, No 2 (2025): Media Jurnal Informatika

Syah Putra, Subhan (Unknown)
Riminarsih, Desti (Unknown)

Publish Date
31 Dec 2025

The implementation of the Core Tax Administration System (Coretax) by the Indonesian Directorate General of Taxes has generated diverse public responses on social media, particularly on platform X, making sentiment analysis a relevant approach to assess public perception of this policy. This study aims to evaluate the performance of machine learning classifiers across different feature extraction and data balancing scenarios. Three machine learning classifiers, namely Multinomial Naïve Bayes, Support Vector Machine (SVM), and Logistic Regression were evaluated under four experimental scenarios combining two feature extraction methods, namely Term Frequency–Inverse Document Frequency (TF-IDF) and Bag of Words (BoW), with original and balanced data distributions. A dataset of more than 50,000 Coretax-related posts collected from platform X was preprocessed and automatically labeled into positive, negative, and neutral sentiment classes using a pretrained IndoBERT sentiment model. A brief manual inspection of a random subset indicates moderate agreement between automatic and manual labels, highlighting potential noise while supporting the use of automatic labeling for comparative analysis. The results show that performance is shaped by the combined effects of representation and data distribution rather than algorithm choice alone. Logistic Regression consistently achieved the most stable and competitive performance across all scenarios, with accuracy values ranging from approximately 0.80 to 0.83 and macro F1-scores around 0.72–0.73. TF-IDF generally provided more stable performance, while data balancing improved prediction fairness for minority sentiment classes despite a slight decrease in overall accuracy. These findings demonstrate that Logistic Regression is the most robust model for Coretax sentiment analysis across varying feature extraction and data balancing conditions and provide practical insights into the influence of data representation and distribution on sentiment classification performance.

Citation Download

EndNote, Reference Manager, ProCite

Latex, Jabref

Check in Google Scholar

Journal Info

Media Jurnal Informatika

Website

Abbrev

mjinformatika

Publisher

Universitas Suryakancana

Subject

Computer Science & IT

Description

Media Jurnal Informatika merupakan oleh jurnal yang diterbitkan oleh Program Studi Teknik Informatika Universitas Suryakancana Cianjur yang terbit setiap 6 Bulan pada Juni dan Desember. Media Jurnal Informatika mulai terbit dengan versi cetak pada tahun 2009 dan terbit satu kali dalam satu tahun, ...

Article Info

Abstract

Evaluating Machine Learning Models Across Feature Extraction and Data Balancing Scenarios for Coretax Sentiment Analysis

Article Info

Abstract