Mu, Jinyi
Unknown Affiliation

Published : 2 Documents Claim Missing Document
Claim Missing Document
Check
Articles

Found 2 Documents
Search

Offline Counterfactual Evaluation for Advertising and Recommendation Slot Policies: A Reproducible Study on the Open Bandit Dataset (Small) Mu, Jinyi; Ye, Tong; Patel, Priya
Journal of Technology Informatics and Engineering Vol. 4 No. 3 (2025): DECEMBER | JTIE : Journal of Technology Informatics and Engineering
Publisher : University of Science and Computer Technology

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.51903/jtie.v4i3.500

Abstract

Offline or counterfactual evaluation is a critical capability for iterating advertising and recommender ranking strategies when online A/B testing is slow, expensive, or risky. Off-policy evaluation (OPE) estimates the expected reward of a candidate policy using logged interaction data from a different behavior policy. Still, it can suffer from high variance under poor overlap and can be misleading when the operational objective is choosing among candidate policies rather than minimizing point-estimation bias alone. This paper presents a fully reproducible empirical study of IPS, self-normalized IPS (SNIPS), doubly robust (DR), and Switch-DR estimators on the Open Bandit Dataset (OBD) small release. Using the Men and Women campaigns (10,000 logged item-impressions per campaign and behavior policy) collected by uniform random and Bernoulli Thompson Sampling (BTS), we construct a held-out oracle for stationary slot-wise policies from the random-traffic split and evaluate both value estimation and policy-ranking consistency on random-logged and BTS-logged test sets. Across 1,000 nonparametric bootstrap replications, IPS and SNIPS are accurate on randomly logged data, whereas BTS-logged data exhibit extreme importance weights and very small effective sample sizes (ESS), making IPS-based ranking unreliable under weak support. Switch-DR is most useful in moderate-overlap regimes, where it truncates high-variance corrections. Still, it introduces bias that depends on the switching threshold and must therefore be stress-tested rather than treated as a universally superior estimator. Finally, we provide a structured reporting template—based on oracle decomposition, overlap diagnostics, and estimator components—for explaining why a policy appears better and how reliable that conclusion is.
Off-Policy Evaluation and Conservative Policy Selection for Slot-Level Dynamic Bidding and Ranking on the Open Bandit Dataset (Small) Ye, Tong; Mu, Jinyi; Hunter, James
Journal of Technology Informatics and Engineering Vol. 5 No. 1 (2026): APRIL | JTIE : Journal of Technology Informatics and Engineering
Publisher : University of Science and Computer Technology

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.51903/jtie.v5i1.503

Abstract

Dynamic bidding and ranking systems must improve revenue or engagement while avoiding harmful regressions during deployment. This paper presents an end-to-end offline OPE and conservative policy-selection workflow for slot-level contextual bandit approximations of ranking decisions. Using the small Open Bandit Dataset (OBD-small) from ZOZOTOWN (ZOZO, Inc.), each logged row is treated as a context-dependent choice among discrete actions (items), with binary click rewards and logged propensity. This formulation is suitable at the slot level but does not capture full listwise ranking or multi-step offline reinforcement learning. Dynamic bidding and ranking systems must improve revenue or engagement while avoiding harmful regressions during deployment. This paper presents an end-to-end offline OPE and conservative policy-selection workflow for slot-level contextual bandit approximations of ranking decisions. Using the small Open Bandit Dataset (OBD-small) from ZOZOTOWN (ZOZO, Inc.), each logged row is treated as a context-dependent choice among discrete actions (items), with binary click rewards and logged propensity. This formulation is suitable at the slot level but does not capture full listwise ranking or multi-step offline reinforcement learning. Empirically, highly deterministic evaluation policies exhibit extreme variance under sparse clicks, while the logistic reward model remains weak (ROC-AUC ≈ 0.5), limiting DM/DR interpretability. Clipped-DR mixing yields only limited certified improvements: in the women’s campaign, gains appear only at moderate confidence (δ=0.10) and for caps up to M=5, whereas stricter or looser settings revert to baseline; in the men’s campaign, certification is largely absent. These findings demonstrate that OPE diagnostics and conservative mixing enable reproducible offline selection under uncertainty, but do not indicate deployment-ready improvements.