Social media, especially Twitter or X, is a rich source of data for sentiment analysis. However, dataset limitation is a major challenge in utilizing machine learning, especially to produce fast and accurate sentiment analysis. This research applies data aggregation techniques to expand the training dataset and tests various preprocessing steps, such as cleaning, case folding, normalization, stemming, and lexicon-based methods. The classification method used is Stochastic Gradient Descent Classifier with text representation using Fast Text language model to generate word embedding. Lexicon-based preprocessing, particularly for emoji and emoticon handling, shows significant impact when data is added, as it is able to capture additional emotion and context that is often overlooked in conventional text analysis. Experimental results show that data addition and preprocessing optimization improved F1 Score from a baseline of 40% to 52.13%, surpassing the organizer which reached 51.28%. These findings emphasize the importance of data aggregation, preprocessing optimization, and parameter tuning using grid search in improving model performance on text sentiment classification with limited datasets.
Copyrights © 2024