Multi-class sentiment analysis on highly imbalanced datasets poses substantial challenges for achieving accurate and equitable classification, particularly when neutral sentiments are considerably underrepresented. This study evaluates four fine-tuned transformer models—Bidirectional Encoder Representations from Transformers (BERT), DistilBERT, RoBERTa, and DeBERTa—using a real-world Amazon review dataset comprising over 20,000 user-generated texts. Sentiment labels were derived from star ratings through a standardized mapping scheme. Experimental results show that while BERT achieved the highest overall accuracy (93%), its performance on the minority Neutral class remained limited (F1-score: 0.36). DeBERTa improved Neutral recall to 0.59 but with a slightly lower overall accuracy of 91%. To address this imbalance, two ensemble strategies were explored: a fixed-weight soft voting scheme and an optimized-weight ensemble combining RoBERTa and DeBERTa. The optimized RoBERTa–DeBERTa ensemble yielded the most balanced performance, achieving a Neutral-class F1-score of 0.57 while maintaining 91% overall accuracy. ROC and PR curve analyses further indicate superior sensitivity–precision balance for this optimized ensemble. The findings indicate that adaptive ensemble weighting can substantially enhance minority-class detection under severe imbalance. This study provides a clear methodological contribution by demonstrating the effectiveness of targeted ensemble optimization and offers practical guidance for developing more balanced and reliable sentiment classification systems.
Copyrights © 2026