Spam email remains a significant problem in digital communication, particularly for Indonesian-language emails, due to linguistic complexity, informal writing styles, and similarities between spam and legitimate (ham) messages. These factors often reduce the effectiveness of traditional spam filtering techniques. This study evaluates the performance of the Support Vector Machine (SVM) algorithm for classifying Indonesian spam emails using a combination of Term Frequency–Inverse Document Frequency (TF-IDF) and N-gram features. The proposed approach applies a text preprocessing pipeline, including case folding, text cleaning, tokenization, stopword removal, and stemming, to reduce noise and improve feature representation. Text data are transformed into numerical vectors using TF-IDF with unigram and bigram configurations to capture individual terms and contextual phrase patterns commonly found in spam emails. A linear kernel SVM is used as the classification model, and its performance is evaluated using K-Fold Cross-Validation to ensure robustness and reduce evaluation bias. The model is assessed using accuracy, precision, recall, and F1-score metrics. Experiments are conducted on the Indonesian Email Spam Dataset, consisting of 2,636 emails, with 1,368 spam messages and 1,268 non-spam (ham) messages. Experimental results show that the proposed model achieved an average accuracy of 98.71%, precision of 98.34%, recall of 99.20%, and F1-score of 98.76 across 10-fold cross-validation. This study contributes to the development of an efficient and lightweight spam detection model for Indonesian-language emails and provides empirical evidence that SVM combined with TF-IDF and N-gram features remains a reliable alternative to more complex deep learning approaches for medium-sized text datasets.
Copyrights © 2026