This study aims to improve the accuracy of lung cancer classification by applying a feature engineering-based machine learning approach from risk factor interactions. The data used comes from the Lung Cancer Risk Dataset on Kaggle, which contains 50,000 patient records with demographic, lifestyle, and medical condition variables. The preprocessing stage includes normalization, one-hot encoding, and the formation of interaction features that represent the nonlinear relationship between smoking habits, environmental exposure, and medical history. Two Random Forest models were compared: a baseline model without interaction features and an expanded model with interaction features. The results showed that the baseline model achieved an accuracy of 0.6973, while the model with interaction features achieved 0.6949, with better interpretability. Visualization through confusion matrices, feature importance plots, and SHAP analysis showed the contribution of engineered features to the interpretability of the model. These results indicate that interaction-based feature engineering can enrich model transparency and provide deeper clinical insights, and has the potential to be applied in clinical decision support systems and precision-based prediction models.
Copyrights © 2025