Non-communicable diseases, especially cardiovascular and chronic respiratory conditions, contribute significantly to Indonesia’s healthcare burden and BPJS expenditure. Health claim data often suffer from class imbalance, multicollinearity, and outliers that impair model accuracy. This study evaluates the impact of essential data exploration techniques such as winsorizing, correlation and VIF analysis, variable selection, and SMOTE on the performance of ensemble classifiers. The dataset comprises 497,439 BPJS health insurance claims from 2022, including 27 predictors (14 numerical and 13 categorical). Two data pipelines were compared: one without preprocessing and another incorporating systematic data exploration. Five ensemble models were tested, namely Decision Tree, Extra Trees, Random Forest, XGBoost, and LightGBM. Model performance was assessed using F1-score, balanced accuracy, and G-mean across 20 stratified cross-validations. The results show that preprocessing substantially improves classification fairness and accuracy. Bagging models, particularly Random Forest, achieved the highest improvement, with balanced accuracy and G-mean increasing from around 0.93 to 0.99. Boosting models showed modest gains. These findings highlight that rigorous data exploration enhances ensemble classifier performance, enabling more reliable disease classification and supporting fairer, data-driven decision-making in BPJS health management.
Copyrights © 2025