Claim Missing Document
Check
Articles

Found 1 Documents
Search

Analysis of the Impact of Data Oversampling on the Support Vector Machine Method for Stroke Disease Classification Luh Ayu Martini; Pradipta, Gede Angga; Huizen, Roy Rudolf
Journal of Electronics, Electromedical Engineering, and Medical Informatics Vol 7 No 2 (2025): April
Publisher : Department of Electromedical Engineering, POLTEKKES KEMENKES SURABAYA

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.35882/jeeemi.v7i2.698

Abstract

Data imbalance is a critical challenge in the classification of medical data, particularly in stroke disease prediction, a life-threatening condition requiring immediate intervention. This imbalance arises due to the disproportionate number of non-stroke cases compared to stroke cases, which can lead to biased models favoring the majority class. Consequently, the model may struggle to correctly identify stroke cases, resulting in lower recall and an increased risk of misdiagnosis. This study evaluates the impact of various oversampling techniques, including Synthetic Minority Over-sampling Technique (SMOTE), Borderline-SMOTE, SMOTE-Edited Nearest Neighbor (SMOTE-ENN), and SMOTE-Instance Prototypes Filtering (SMOTE-IPF), along with feature selection using Information Gain and Chi-Square, to assess their influence on model performance. Oversampling is utilized to address class imbalance by generating synthetic samples, thereby improving the representation of the minority class. Feature selection is employed to eliminate irrelevant or redundant features, enhancing both interpretability and computational efficiency. The dataset obtained from Kaggle, consists of 5,110 records and 12 features. Support Vector Machine (SVM) is used as the classification algorithm, with evaluations conducted on Linear, Radial Basis Function (RBF), and Polynomial kernels. Experimental results indicate that the highest performance is achieved by the combination of Borderline-SMOTE and the RBF kernel, yielding an accuracy of 96.86%, precision of 98.65%, recall of 94.99%, and an F1-score of 96.79%. This model outperforms others in stroke disease classification, demonstrating that the integration of oversampling techniques can effectively enhance prediction accuracy. Future research could focus on implementing deep learning-based models to further optimize stroke classification in the case of imbalanced data. These advancements are expected to enhance model performance, leading to a more effective and efficient approach for medical datasets.