This study aims to evaluate the performance of three machine learning models-Logistic Regression, Naive Bayes, and Random Forest-in predicting lung cancer using a publicly available dataset from Kaggle. The data used included demographic information, risk factors, and diagnostic imaging features, with significant class imbalance between benign and malignant cases. To address this imbalance, the Synthetic Minority Sampling Technique (SMOTE) was applied. In addition, Principal Component Analysis (PCA) and Recursive Feature Elimination (RFE) were used for dimensionality reduction and feature selection to improve model performance. The results showed that Random Forest, especially when combined with PCA, outperformed the other models with the highest accuracy of 96.77% and a balanced F1 score of 0.50 for the minority class. Although Logistic Regression achieved high accuracy, it was less effective in predicting minority classes, especially when combined with RFE. Meanwhile, Naive Bayes showed moderate performance but was limited by the assumption of feature independence. The application of SMOTE significantly improved the model's ability to handle class imbalance, while PCA proved more effective than RFE in improving model performance. This study highlights the importance of selecting appropriate machine learning models and preprocessing techniques for lung cancer prediction. Random Forest, with its ability to model complex relationships and handle imbalanced data, emerged as the most effective model for this task. These findings underscore the potential of machine learning in medical diagnostics and provide valuable insights for future research.
Copyrights © 2025