Classification of unbalanced multiclass datasets is still a major challenge in machine learning in many fields of applications, including medical diagnostics, fraud detection, and picture classification, where minority classes are the most crucial, but at the same time under-represented. Classical classification algorithms designed for balanced data tend to overfit the majority classes deeming a large number of minority classes misclassified and, as a result, compromising the model's performance. This review covers the main state-of-the-art techniques for class imbalance problems including under-sampling and over-sampling techniques, ensemble approaches, cost-sensitive learning, and producing synthetic data via SMOTE (synthetic minority oversampling technique). Recently, GANs (Generative Adversarial Networks) have also been employed to generate synthetic data, specifically valuable for complex datasets where realistic data augmentation is needed. Each of these techniques is analyzed in terms of their capability of dealing with imbalanced data through conventional metrics such as accuracy and specific metrics for imbalanced datasets such as F1-score, G-mean, and others. Recent advancements, such as hybrid approaches and learning from deep learning models are also discussed as viable solutions given the complexities associated with big data (high dimensional and large) and their corresponding models. Such comparative analysis should facilitate the construction of more robust models that handle complex data in modern applications.
Copyrights © 2025