Spam emails represent a substantial concern within the digital landscape, impeding users with unsolicited communications. This study elucidates the utilization of a Support Vector Machine (SVM) coupled with a TF-IDF Vectorizer for categorizing emails into spam and non-spam classifications. The model was developed utilizing two publicly accessible pre-processed datasets: the TREC 2007 Public Spam Corpus and the Enron-Spam Dataset. By employing the TF-IDF algorithm, which allocates heightened importance to infrequent yet pertinent terms, alongside SVM, renowned for its efficacy in textual classification, the model exhibits remarkable efficacy, achieving an accuracy of 99.04%, a precision of 98.57% and a recall of 99.62%. These findings underscore the model's formidable capacity to discern spam emails while concurrently minimizing false positives accurately. This is critical for real-world applications where authentic emails must not be erroneously categorized as spam. Furthermore, this study elaborates on the justification for the selection of TF-IDF and SVM in the context of spam email classification, in addition to the evaluation outcomes of the model, which align with existing literature, wherein the integration of SVM with TF-IDF has demonstrated substantial performance in spam detection endeavours.
Copyrights © 2025