Indonesian Journal of Electronics, Electromedical Engineering, and Medical Informatics
Vol. 8 No. 2 (2026): May

A Two-Stage Hybrid Oversampling and Ensemble Learning Framework for Improved Type 2 Diabetes Mellitus Classification

Permatasari, Siti Fatimah Nurdiah (Unknown)
Ermatita, Ermatita (Unknown)



Article Info

Publish Date
02 May 2026

Abstract

Type 2 Diabetes Mellitus (T2DM) screening using clinical tabular data commonly suffers from class imbalance, where non-diabetic records dominate diabetic cases, causing models to bias toward the majority class and yield poor detection of the positive (diabetic) class. This study aims to improve T2DM classification on an imbalanced dataset by increasing minority-class detection while maintaining acceptable overall performance. The main contribution is a leakage-safe framework that integrates two-stage hybrid oversampling (RandomOverSampler followed by Borderline-SMOTE) and soft-voting ensemble learning to obtain more balanced predictions. Experiments were conducted on the Diabetes Bangladesh (DiaBD) dataset, containing 5,288 clinical records with a binary target, diabetic (Yes/No, mapped to 1/0). The data were stratified into train_full/test splits (80/20) and further into train/validation splits (80/20 of train_full). Features were normalized using MinMaxScaler fitted only on the training set and applied to validation and test sets to prevent data leakage. Class imbalance handling was applied only on the training set using the proposed two-stage oversampling (ROS Borderline-SMOTE; borderline-1, k_neighbors=3). Classification models included SVM (RBF), Random Forest, and Gradient Boosting, as well as soft-voting ensembles of two and three models. Results show that the baseline setting (No OS) can achieve high accuracy but low minority detection; for instance, SVM (No OS) reached an accuracy of 0.9374 with a Recall_pos of 0.0909 and an F1_pos of 0.1587. After oversampling, SVM (OS) improved minority recall to 0.7273 with F1_pos 0.4188, although accuracy decreased to 0.8688 due to increased false positives. The best-balanced performance was achieved by the SVM + RandomForest soft-voting ensemble (OS) with accuracy 0.9125, Recall_pos 0.6545, and the highest F1_pos 0.4932. Overall, the proposed two-stage hybrid oversampling combined with soft-voting ensembles improves T2DM detection on imbalanced tabular data, and the findings highlight that model selection should prioritize Recall_pos and F1_pos rather than accuracy alone.

Copyrights © 2026






Journal Info

Abbrev

ijeeemi

Publisher

Subject

Computer Science & IT Control & Systems Engineering Decision Sciences, Operations Research & Management Electrical & Electronics Engineering Health Professions Materials Science & Nanotechnology

Description

Indonesian Journal of Electronics, Electromedical Engineering, and Medical Informatics (IJEEEMI) publishes peer-reviewed, original research and review articles in an open-access format. Accepted articles span the full extent of the Electronics, Biomedical, and Medical Informatics. IJEEEMI seeks to ...