Predicting corrosion inhibition efficiency IE (%) is often hindered by small, heterogeneous datasets. This study proposes a Gaussian mixture–based data augmentation pipeline to strengthen QSAR generalization under data scarcity. A curated set of 70 drug-like compounds with 14 physicochemical and quantum descriptors was cleaned, split 90/10 (train/test), and transformed using a Quantile Transformer followed by a Robust Scaler. A Gaussian Mixture model (GMM) with 2–5 components selected by the variational lower bound was fitted to the transformed training features and used to generate up to 2,500 synthetic samples. Eight regressors (Gaussian Process, Decision Tree, Random Forest, Bagging, Gradient Boosting, Extra Trees, SVR, and KNN) were evaluated on the held-out test set using R2 and RMSE. Augmentation improved performance across several families: for example, Gaussian Process R2 improved from −1.54 to 0.54 (RMSE 11.71 to 5.01) and Decision Tree R2 from −0.33 to 0.63 (RMSE 8.48 to 4.44), Bagging and Random Forest showed R2 increases of 0.67 and 0.40, respectively. The optimal synthetic size varied by model.
Copyrights © 2025