Mu, Jinyi
Unknown Affiliation

Published : 1 Documents Claim Missing Document
Claim Missing Document
Check
Articles

Found 1 Documents
Search

Offline Counterfactual Evaluation for Advertising and Recommendation Slot Policies: A Reproducible Study on the Open Bandit Dataset (Small) Mu, Jinyi; Ye, Tong; Patel, Priya
Journal of Technology Informatics and Engineering Vol. 4 No. 3 (2025): DECEMBER | JTIE : Journal of Technology Informatics and Engineering
Publisher : University of Science and Computer Technology

Show Abstract | Download Original | Original Source | Check in Google Scholar | DOI: 10.51903/jtie.v4i3.500

Abstract

Offline or counterfactual evaluation is a critical capability for iterating advertising and recommender ranking strategies when online A/B testing is slow, expensive, or risky. Off-policy evaluation (OPE) estimates the expected reward of a candidate policy using logged interaction data from a different behavior policy. Still, it can suffer from high variance under poor overlap and can be misleading when the operational objective is choosing among candidate policies rather than minimizing point-estimation bias alone. This paper presents a fully reproducible empirical study of IPS, self-normalized IPS (SNIPS), doubly robust (DR), and Switch-DR estimators on the Open Bandit Dataset (OBD) small release. Using the Men and Women campaigns (10,000 logged item-impressions per campaign and behavior policy) collected by uniform random and Bernoulli Thompson Sampling (BTS), we construct a held-out oracle for stationary slot-wise policies from the random-traffic split and evaluate both value estimation and policy-ranking consistency on random-logged and BTS-logged test sets. Across 1,000 nonparametric bootstrap replications, IPS and SNIPS are accurate on randomly logged data, whereas BTS-logged data exhibit extreme importance weights and very small effective sample sizes (ESS), making IPS-based ranking unreliable under weak support. Switch-DR is most useful in moderate-overlap regimes, where it truncates high-variance corrections. Still, it introduces bias that depends on the switching threshold and must therefore be stress-tested rather than treated as a universally superior estimator. Finally, we provide a structured reporting template—based on oracle decomposition, overlap diagnostics, and estimator components—for explaining why a policy appears better and how reliable that conclusion is.