Dynamic bidding and ranking systems must improve revenue or engagement while avoiding harmful regressions during deployment. This paper presents an end-to-end offline OPE and conservative policy-selection workflow for slot-level contextual bandit approximations of ranking decisions. Using the small Open Bandit Dataset (OBD-small) from ZOZOTOWN (ZOZO, Inc.), each logged row is treated as a context-dependent choice among discrete actions (items), with binary click rewards and logged propensity. This formulation is suitable at the slot level but does not capture full listwise ranking or multi-step offline reinforcement learning. Dynamic bidding and ranking systems must improve revenue or engagement while avoiding harmful regressions during deployment. This paper presents an end-to-end offline OPE and conservative policy-selection workflow for slot-level contextual bandit approximations of ranking decisions. Using the small Open Bandit Dataset (OBD-small) from ZOZOTOWN (ZOZO, Inc.), each logged row is treated as a context-dependent choice among discrete actions (items), with binary click rewards and logged propensity. This formulation is suitable at the slot level but does not capture full listwise ranking or multi-step offline reinforcement learning. Empirically, highly deterministic evaluation policies exhibit extreme variance under sparse clicks, while the logistic reward model remains weak (ROC-AUC ≈ 0.5), limiting DM/DR interpretability. Clipped-DR mixing yields only limited certified improvements: in the women’s campaign, gains appear only at moderate confidence (δ=0.10) and for caps up to M=5, whereas stricter or looser settings revert to baseline; in the men’s campaign, certification is largely absent. These findings demonstrate that OPE diagnostics and conservative mixing enable reproducible offline selection under uncertainty, but do not indicate deployment-ready improvements.
Copyrights © 2026