Garuda - Garba Rujukan Digital

Article Per Year (5 Year)

p-Index From 2021 - 2026

0.408

P-Index

This Author published in this journals

All Journal MATRIK : Jurnal Manajemen, Teknik Informatika, dan Rekayasa Komputer

Dewa Ayu Putu Rasmika Dewi

Monash University, Melbourne, Australia

Author-ID : 8505933

Computer Science & IT

Published : 1 Documents Claim Missing Document

Claim Missing Document

Articles

Evaluation Analysis of the Necessity of Stemming and Lemmatization in Text Classification Ni Wayan Sumartini Saraswati; Christina Purnama Yanti; I Dewa Made Krishna Muku; Dewa Ayu Putu Rasmika Dewi
MATRIK : Jurnal Manajemen, Teknik Informatika dan Rekayasa Komputer Vol. 24 No. 2 (2025)
Publisher : Universitas Bumigora

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.30812/matrik.v24i2.4833

Stemming and lemmatization are text preprocessing methods that aim to convert words into their root and to the canonical or dictionary form. Some previous studies state that using stemming and lemmatization worsens the performance of text classification models. However, some other studies report the positive impact of using stemming and lemmatization in supporting the performance of text classification models. This study aims to analyze the impact of stemming and lemmatization in text classification work using the support vector machine method, in this case, devoted to English text datasets and Indonesian text datasets, and analyze when this method should be used. The analysis of the experimental results shows that the use of stemming will generally degrade the performance of the text classification model, especially on large and unbalanced datasets. The research process consisted of several stages: text preprocessing using stemming and lemmatization, feature extraction with Term Frequency-Inverse Document Frequency (TF-IDF), classification using SVM, and model evaluation with 4 experiment scenarios. Stemming performed the best computation time, completing in 4 hours, 51 minutes, and 41.3 seconds on the largest dataset. While lemmatization positively impacts classification performance on small datasets, achieving 91.075% accuracy results in the worst computation time, especially for large datasets, which take 5 hours, 10 minutes, and 25.2 seconds. The Experimental results also show that stemming from the Indonesian balanced dataset yields a better text classification model performance, reaching 82.080% accuracy.

Co-Authors Christina Purnama Yanti I Dewa Made Krishna Muku Saraswati, Ni Wayan Sumartini

Title

Found 1 Documents
Search

Abstract

Title Search

Found 1 Documents Search

Abstract

Title

Found 1 Documents
Search