Thyroid disease is one of the essential health threats and requires early detection to enable more effective medical intervention. This study aims to develop a classification model using the XGBoost algorithm to categorize patient clinical data from the Kaggle platform into three levels of thyroid cancer risk: low, moderate, and high. The data processing process follows the stages of the SEMMA (Sample, Explore, Modify, Model, Assess) methodology, with main techniques such as label coding, stratified 5-fold cross-validation, and hyperparameter tuning being applied. Performance evaluation was conducted using accuracy metrics, including F1-score and AUC-ROC. The results show that the model exhibits excellent performance in detecting low-risk cases (AUC = 1.00), but it still faces challenges in classifying moderate and high-risk categories. After adjusting the hyperparameters, the validation accuracy increased to 96.24%, although the final accuracy on the test data remained at 69.85%. These findings suggest that XGBoost is a promising approach for the early assessment of thyroid disease risk, particularly in detecting low-risk cases. However, further model development is needed to enhance generalizability across risk levels and support informed clinical decision-making.
Copyrights © 2025