Recruiter screening increasingly relies on large language model (LLM)-assisted workflows, but high-stakes applications require reproducible matching, calibrated probabilities, and reliable handling of uncertain cases. This study evaluates a screening framework combining matching, calibration, and selective refusal using two public datasets: resume-job-description-fit for supervised pairwise learning and Resume-Screening-Dataset for benchmarking and external generalization. After deterministic preprocessing, we compared cosine similarity, alignment features, TF-IDF pairwise models, and hybrid models integrating text, alignment, and title information. The strongest probabilistic models were calibrated with Platt scaling and isotonic regression and evaluated under confidence-based refusal. On the resume-job-description-fit test set, the best three-class model achieved a macro-F1 of 0.450. For binary shortlist-versus-reject screening, the title-augmented hybrid model obtained 0.654 balanced accuracy, 0.647 F1, and 0.699 AUROC. Platt calibration improved probability estimates by reducing the Brier score from 0.232 to 0.226 and negative log-likelihood from 0.772 to 0.675. Selective refusal further improved in-domain accuracy, while cross-dataset transfer remained weak (AUROC 0.47–0.51). These results indicate that matching, calibration, and selective refusal enhance trustworthy within-domain screening, although human review remains essential under distribution shift.
Copyrights © 2025