Text classification is a fundamental task in Natural Language Processing (NLP) that supports the categorization of data based on predefined labels. This study aims to evaluate the effectiveness of keyword-based labeling and sentiment analysis methods for text classification using the Quora Questions dataset. The dataset comprises 16,921 samples with imbalanced class distribution, where the opinion category dominates, while the hypothetical category is a minority class. The labeling process utilized a keyword-based approach for the fact and hypothetical categories, while the opinion category was labeled using sentiment analysis with the Vader Lexicon library. TF-IDF was employed as the feature representation method, with two approaches explored: n-gram range tuning (1–3) and without tuning. ComplementNB, designed for handling imbalanced datasets, was utilized for classification, with a training-test split of 70:30. The results show that the approach without n-gram tuning achieved the highest accuracy of 93.89%, with zero variance in cross-validation. Evaluation revealed that ComplementNB effectively handles class imbalance, as demonstrated by high precision and recall in the minority class. This study demonstrates that a simple approach combining keyword-based labeling and sentiment analysis can be effectively implemented for category-based text classification tasks, particularly in platforms like Quora. These findings are relevant for similar applications requiring real-time text classification with minimal complexity.
Copyrights © 2025