Lung diseases are among the leading causes of death worldwide and require early, accurate diagnosis to minimize the risk of complications. In the digital era, developing artificial intelligence–based classification models has become a potential solution to support the diagnostic process, particularly for categorical data that represent symptoms such as coughing, shortness of breath, and smoking history. This study proposes a lung disease classification model using the K-Nearest Neighbor (K-NN) algorithm with a simple categorical distance approach, namely the Hamming distance. The dataset used is imbalanced; therefore, data balancing was performed using the random oversampling method. Model evaluation was carried out using two schemes—data splitting and 10-fold cross-validation—by testing multiple values of parameter k. The best results were obtained at k = 7 with an accuracy of 94.58%, precision of 95.25%, recall of 94.39%, and an F1-score of 94.53%. These findings demonstrate that the combination of the K-NN algorithm, Hamming distance, and oversampling can produce high and stable classification performance for categorical datasets in lung disease prediction.
Copyrights © 2026