This research presents an AI-driven framework for multi-disease classification using natural language symptom descriptions, optimized through large language model (LLM) oriented preprocessing techniques. The proposed system integrates essential NLP steps text normalization, lemmatization, and n-gram vectorization to convert unstructured clinical symptom data into machine-readable form. A publicly available dataset comprising 8,498 samples across ten common diseases, including pneumonia, heart attack, diabetes, stroke, asthma, and depression, was used for training and evaluation. Data balancing and cleaning ensured uniform class representation with 1,200 samples per disease category. The processed dataset was subjected to supervised machine learning models, including SVM, KNN, Decision Tree, Random Forest, and Extra Trees, to identify the most effective classifier. Experimental results, conducted in Google Colab, showed that ensemble models (Random Forest and Extra Trees) significantly outperformed the others, achieving 99% accuracy, precision, recall, and F1-scores, while SVM and Decision Tree followed closely with 98% performance across metrics. Notably, the models consistently predicted pneumonia with high confidence for relevant input queries , validating the framework's robustness. This work demonstrates the efficacy of integrating LLM-compatible preprocessing with traditional ML classifiers for accurate disease detection based on symptom narratives. The proposed approach serves as a foundational step toward developing scalable, intelligent healthcare support systems capable of real-time disease prediction and decision-making assistance.
Copyrights © 2025