Claim Missing Document
Check
Articles

Found 31 Documents
Search

Evaluation Analysis of the Necessity of Stemming and Lemmatization in Text Classification Saraswati, Ni Wayan Sumartini; Yanti, Christina Purnama; Muku, I Dewa Made Krishna; Dewi, Dewa Ayu Putu Rasmika
MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer Vol. 24 No. 2 (2025)
Publisher : Universitas Bumigora

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.30812/matrik.v24i2.4833

Abstract

Stemming and lemmatization are text preprocessing methods that aim to convert words into their root and to the canonical or dictionary form. Some previous studies state that using stemming and lemmatization worsens the performance of text classification models. However, some other studies report the positive impact of using stemming and lemmatization in supporting the performance of text classification models. This study aims to analyze the impact of stemming and lemmatization in text classification work using the support vector machine method, in this case, devoted to English text datasets and Indonesian text datasets, and analyze when this method should be used. The analysis of the experimental results shows that the use of stemming will generally degrade the performance of the text classification model, especially on large and unbalanced datasets. The research process consisted of several stages: text preprocessing using stemming and lemmatization, feature extraction with Term Frequency-Inverse Document Frequency (TF-IDF), classification using SVM, and model evaluation with 4 experiment scenarios. Stemming performed the best computation time, completing in 4 hours, 51 minutes, and 41.3 seconds on the largest dataset. While lemmatization positively impacts classification performance on small datasets, achieving 91.075% accuracy results in the worst computation time, especially for large datasets, which take 5 hours, 10 minutes, and 25.2 seconds. The Experimental results also show that stemming from the Indonesian balanced dataset yields a better text classification model performance, reaching 82.080% accuracy.