Data imbalance is a common challenge in medical classification, including in the diagnosis of Tuberculosis (TB), where the number of positive cases is significantly lower than that of negative cases. This condition can reduce model performance, particularly in detecting the minority class. This study aims to evaluate the performance of the Random Forest method in classifying imbalanced TB data by applying a combination of the ADASYN and Tomek Links re-sampling techniques. The dataset used was obtained from the Cisarua Public Health Center (Puskesmas), Bogor, consisting of 1,069 patient records with 15 features and one target label. The research process included data preprocessing, one-hot encoding, data splitting, the use of ADASYN to generate synthetic samples for the minority class, and the application of Tomek Links to remove ambiguous data in overlapping class regions. The evaluation employed accuracy, precision, recall, and F1-score metrics using both hold-out and k-fold cross-validation schemes. The results show that the combination of ADASYN and Tomek Links improved the F1-score for the positive class from 0.67 to 0.71 in the hold-out evaluation, and reached 0.9129 in the cross-validation evaluation. These findings indicate that the proposed approach is effective in addressing data imbalance and has the potential to be integrated into clinical decision-support systems in community health centers (Puskesmas) to aid in early detection of TB cases.
Copyrights © 2025