Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining 2018
DOI: 10.1145/3159652.3159687
|View full text |Cite
|
Sign up to set email alerts
|

Offline A/B Testing for Recommender Systems

Abstract: Online A/B testing evaluates the impact of a new technology by running it in a real production environment and testing its performance on a subset of the users of the platform. It is a well-known practice to run a preliminary offline evaluation on historical data to iterate faster on new ideas, and to detect poor policies in order to avoid losing money or breaking the system. For such offline evaluations, we are interested in methods that can compute offline an estimate of the potential uplift of performance g… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
121
0

Year Published

2018
2018
2022
2022

Publication Types

Select...
6
2
2

Relationship

0
10

Authors

Journals

citations
Cited by 153 publications
(121 citation statements)
references
References 20 publications
0
121
0
Order By: Relevance
“…We thus Here we use weight truncation [31], w T (w) = min(w, w 0 ), where weights are truncated as they cross a threshold w 0 . While many other works [15,20] have focused on comparing a wide variety of counterfactual estimators (e.g., weight dropping, truncation, model-based and doubly-robust variants) we opted not to do this here, as our main focus is on the evaluation of the impact of identity data, and as we were not able to follow up with the kind of extensive A/B testing that would be required to definitively judge the relative performance of various estimators. We did do our analysis twice, once using the weight truncation mentioned above, and once using weight dropping [5], where weights are set to zero upon crossing a threshold w 0 , finding estimates that were consistent with w T and of similar variance (these are omitted for clarity in what follows).…”
Section: Inverse-propensity Weightingmentioning
confidence: 99%
“…We thus Here we use weight truncation [31], w T (w) = min(w, w 0 ), where weights are truncated as they cross a threshold w 0 . While many other works [15,20] have focused on comparing a wide variety of counterfactual estimators (e.g., weight dropping, truncation, model-based and doubly-robust variants) we opted not to do this here, as our main focus is on the evaluation of the impact of identity data, and as we were not able to follow up with the kind of extensive A/B testing that would be required to definitively judge the relative performance of various estimators. We did do our analysis twice, once using the weight truncation mentioned above, and once using weight dropping [5], where weights are set to zero upon crossing a threshold w 0 , finding estimates that were consistent with w T and of similar variance (these are omitted for clarity in what follows).…”
Section: Inverse-propensity Weightingmentioning
confidence: 99%
“…To perform evaluation of RL methods on ground-truth, a straightforward way is to evaluate the learned policy through online A/B test, which, however, could be prohibitively expensive and may hurt user experiences. Suggested by [6,7], a specific off-policy estimator NCIS [25] is employed to evaluate the performance of different recommender agents. The step-wise variant of NCIS is defined aŝ ρ t 1 :t 2 = min{c,…”
Section: Evaluation Settingmentioning
confidence: 99%
“…Counterfactual evaluation [5,11,20] uses causal inference techniques -often inverse propensity scoring -to estimate how users would have responded to a different recommender algorithm. However, it is difficult to apply to commonly-used data sets and does not yield insight into the reliability of existing evaluations.…”
Section: Related Workmentioning
confidence: 99%