Sentiment analysis plays a crucial role in assessing public perception, particularly in healthcare services like BPJS Kesehatan, Indonesia’s national health insurance program. However, sentiment classification faces a challenge due to class imbalance, where negative feedback dominates positive responses. This study investigates whether sentiment classification should prioritize traditional evaluation or maintain real-world data representation by preserving the original sentiment distribution. Two feature extraction methods, Term Frequency-Inverse Document Frequency (TF-IDF) and Bag of Words (BoW), were evaluated using Naïve Bayes, Support Vector Machine (SVM), and Logistic Regression with varying maximum feature counts (100–300) to examine the impact of feature dimensionality. Model performance was evaluated using traditional metrics, while sentiment distribution fidelity was assessed by comparing predicted proportions with the dataset. Results show TF-IDF achieves higher precision and recall but fails to capture positive sentiments, leading to a skewed representation of real-world trends, while BoW offers a more balanced distribution with slightly lower accuracy. Paired t-tests and Wilcoxon signed-rank tests confirmed differences in accuracy and recall are significant, but not in precision and sentiment distribution. These findings highlight a trade-off between performance and sentiment diversity, vital in healthcare services and other fields with imbalanced datasets, emphasizing the need to align evaluation metrics with real-world objectives. Future research should investigate advanced models, such as deep learning and transformer-based approaches, to enhance both accuracy and fairness when analyzing imbalanced data.