The classification of drugs into Prescription (Rx) and Over-the-Counter (OTC) categories is an important aspect of pharmaceutical governance because it has a direct impact on patient safety, drug access, and regulatory compliance. However, large-scale pharmaceutical data often consists of heterogeneous categorical variables and short texts, such as product names or indications, which poses challenges in the form of duplication, inconsistencies, and potential class imbalances. This condition demands a modeling approach that is not only accurate, but also lightweight and explainable. This study proposes a hybrid ensemble model that combines three algorithms, namely CART, Random Forest, and LightGBM, through a weighted soft-voting mechanism. This approach combines decision tree transparency with the reliability of modern boosting techniques. The main contribution of this study is to show that a low-complexity domain-based pipeline can produce accurate, efficient, and easily auditable Rx and OTC classifications for both clinical and regulatory needs. The pre-processing pipeline includes TF-IDF for short text, One-Hot Encoding for categorical features, as well as simple dosage variables. All features were combined into a solid matrix, then trained using weighted ensembles [1,1,8]. Evaluations include Accuracy, Precision, Recall, F1-score, ROC-AUC, Brier score, confusion matrix, and ROC curve. Test results on a dataset of 50,000 balanced samples showed consistent in-sample performance: Accuracy = 0.742; Accuracy = 0.742; Recall = 0.742; F1 = 0.742; ROC-AUC = 0.819; then Brier score = 0.214. The model is able to stably distinguish classes with a balance between False Positive and False Negative errors. In conclusion, this lightweight ensemble is able to present competitive prediction performance as well as interpretation, so that it has the potential to be applied to pharmacovigilance and drug classification. Further studies suggest adding cross-validation, probability calibration, as well as robustness tests to data outside the distribution to strengthen the reliability of the model
Copyrights © 2025