Diabetes mellitus is a chronic metabolic disorder characterized by high blood glucose levels, requiring early and accurate detection to prevent long-term complications. Machine learning is increasingly important in data-driven diagnostics, with the Naive Bayes algorithm widely used due to its simplicity, transparency, and efficiency. This study evaluates the classification performance of Naive Bayes for early diabetes screening using a clinical dataset containing incomplete and heterogeneous medical records. The pre-processing involved data cleaning, replacing missing values with the median, labeling patients based on a glyhb threshold ≥6.5%, preventing data leakage, and converting categorical variables into numerical form. Model training was performed with a 70:30 split, and performance was evaluated through accuracy, precision, recall, F1 score, and AUC. The classifier achieved an accuracy of 90.81% and an AUC of 0.919, outperforming standard baseline Naive Bayes implementations which typically report accuracies in the range of 76-78% on similar datasets. Despite this stability, the model showed varying sensitivity in identifying positive diabetes cases, largely due to class imbalance. Therefore, Naive Bayes is considered reliable as a preliminary screening method, but improvements through oversampling or cost-sensitive learning techniques are recommended to enhance recall and ensure more accurate patient identification in future clinical applications.
Copyrights © 2026