Dewa Ayu Putu Rasmika Dewi
Monash University, Melbourne, Australia

Published : 1 Documents Claim Missing Document
Claim Missing Document
Check
Articles

Found 1 Documents
Search

Evaluation Analysis of the Necessity of Stemming and Lemmatization in Text Classification Ni Wayan Sumartini Saraswati; Christina Purnama Yanti; I Dewa Made Krishna Muku; Dewa Ayu Putu Rasmika Dewi
MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer Vol. 24 No. 2 (2025)
Publisher : LPPM Universitas Bumigora

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.30812/matrik.v24i2.4833

Abstract

Stemming and lemmatization are text preprocessing methods that aim to convert words into their root and to the canonical or dictionary form. Some previous studies state that using stemming and lemmatization worsens the performance of text classification models. However, some other studies report the positive impact of using stemming and lemmatization in supporting the performance of text classification models. This study aims to analyze the impact of stemming and lemmatization in text classification work using the support vector machine method, in this case, devoted to English text datasets and Indonesian text datasets, and analyze when this method should be used. The analysis of the experimental results shows that the use of stemming will generally degrade the performance of the text classification model, especially on large and unbalanced datasets. The research process consisted of several stages: text preprocessing using stemming and lemmatization, feature extraction with Term Frequency-Inverse Document Frequency (TF-IDF), classification using SVM, and model evaluation with 4 experiment scenarios. Stemming performed the best computation time, completing in 4 hours, 51 minutes, and 41.3 seconds on the largest dataset. While lemmatization positively impacts classification performance on small datasets, achieving 91.075% accuracy results in the worst computation time, especially for large datasets, which take 5 hours, 10 minutes, and 25.2 seconds. The Experimental results also show that stemming from the Indonesian balanced dataset yields a better text classification model performance, reaching 82.080% accuracy.