The rapid growth of scientific publications poses challenges in grouping journal articles based on subject area, especially when using metadata such as titles, abstracts, and keywords. However, differences in feature representation and classification algorithms often result in varying performance, requiring comparative studies to determine the optimal model combination. This study compares four combinations of subject area classification models, namely TF-IDF + Naïve Bayes, TF-IDF + Support Vector Machine, Bag-of-Words + Support Vector Machine, and Bag-of-Words + Naïve Bayes. The research process included text preprocessing, feature extraction, and testing using an 80% training and 20% testing data split scheme in five scenarios. The evaluation was performed using confusion matrices, accuracy, precision, recall, and F1-score. The experimental results showed variations in performance between models, with an average F1-score of 0.8103 for TF-IDF + Naïve Bayes, 0.8494 for TF-IDF + Support Vector Machine, 0.8297 for Bag-of-Words + Support Vector Machine, and 0.8335 for Bag-of-Words + Naïve Bayes as the best performance. These findings indicate that a word frequency-based approach combined with Naïve Bayes is effective for classifying journal article subject areas based on metadata, although challenges remain in subject areas with semantic proximity.
Copyrights © 2026