Manual resume screening is an inefficient and bias-prone process, yet comprehensive benchmarks of machine learning models on imbalanced, real-world recruitment data remain scarce. This study addresses this gap by benchmarking seven models from classical, ensemble, and deep learning paradigms for automated resume classification. Using a private dataset of 2,483 resumes across 24 job categories, this study evaluates the models with distinct TF-IDF and BERT embedding feature pipelines and an adaptive strategy for handling class imbalance (Class Weights, SMOTE, SMOTEENN). The results showed that the XGBoost model achieved the highest performance (weighted F1-score of 0.779), followed by the highly competitive BERT (F1 0.728) and Random Forest (F1 0.711) models. Despite these methods, all models struggled with extreme minority classes, confirming data scarcity as a primary limitation. This study provides a valuable benchmark and an evidence-based framework for HR practitioners, highlighting the critical trade-off between predictive performance (XGBoost), interpretability (Random Forest), and semantic capability (BERT). The findings conclude that the primary challenge is data representation, steering future work towards data augmentation and fairness audits.
Copyrights © 2025