Fraud detection on payment transactions is an extremely imbalanced, high-stakes classification task in which deployment decisions depend not only on ranking quality but also on reliable probability estimates. We study credit card fraud detection on a standard real-transaction benchmark (284,807 transactions; 492 frauds) and target two deployment requirements: cost-sensitive thresholding under asymmetric error costs and reliability calibration so model outputs can be interpreted as stable risk scores. We benchmark logistic regression and XGBoost and propose a focal-proxy reweighting scheme for boosted trees via iterative weight updates inspired by focal loss. Probabilities are calibrated on validation using Platt scaling, temperature scaling, and isotonic-style monotone calibration; the best calibrator is selected by minimum validation Brier score. For decision-making, we choose the operating threshold that minimizes expected cost, Cost(t) = 10 · FN(t) + 1 · FP(t), on validation, then evaluate on a held-out test set. On the benchmark split (train 199,364; validation 42,721; test 42,722), the calibrated XGBoost baseline achieves AUROC 0.973, AUPRC 0.812, fraud-class F1 0.767, and expected cost 154 with very low calibration error (ECE = 1.1 × 10⁻⁴). Overall, calibration reduces ECE and improves or maintains the Brier score, while cost-aware thresholding makes the FN/FP trade-off explicit via decision curves.
Copyrights © 2025