Sentiment classification is usually done manually by humans. Manual senti- ment labeling is ineffective. Therefore, automated labeling using machine learning is es- sential. Building a computerized labeling model presents challenges when labeled data is scarce, which can decrease model accuracy. This study proposes a semi-supervised learn- ing (SSL) framework for sentiment analysis with limited labeled data. The framework integrates self-learning and enhanced co-training. The co-training model combines three machine learning methods: Support Vector Machine (SVM), Random Forest (RF), and Lo- gistic Regression (LR). We use TF-IDF and FastText for feature extraction. The co-training model will generate pseudo-labels. Then, the pseudo-labels from models (SVM, RF, LR) are checked to choose the highest confidence — this is called self-learning. This framework is applied to English and Indonesian language datasets. We ran each dataset five times. The performance difference between the baseline model (without pseudo-labels) and SSL (with pseudo-labels) is not significant; the Wilcoxon Signed-Rank Test confirms it, obtaining a p- value < 0.05. Results show that SSL produces pseudo-labels on unlabeled data with quality close to the original labels on unlabeled data. Although the significance test performs well on four datasets, it has not yet surpassed the performance of the supervised classification (baseline). Labeling using SSL proves more efficient than manual labeling, as evidenced by the processing time of around 10-20 minutes to label thousands to tens of thousands of samples. In conclusion, self-learning in SSL with co-training can effectively label unla- beled data in multilingual and limited datasets, but it has not yet converged across various datasets.
Copyrights © 2025