Offline A/B Testing for Recommender Systems

Gilotte, Alexandre; Calauzènes, Clément; Nedelec, Thomas; Alexandre, A.; Dollé, Simon

doi:10.1145/3159652.3159687

Cited by 153 publications

(121 citation statements)

References 20 publications

Supporting

Mentioning

121

Contrasting

Order By: Relevance

“…We thus Here we use weight truncation [31], w T (w) = min(w, w 0 ), where weights are truncated as they cross a threshold w 0 . While many other works [15,20] have focused on comparing a wide variety of counterfactual estimators (e.g., weight dropping, truncation, model-based and doubly-robust variants) we opted not to do this here, as our main focus is on the evaluation of the impact of identity data, and as we were not able to follow up with the kind of extensive A/B testing that would be required to definitively judge the relative performance of various estimators. We did do our analysis twice, once using the weight truncation mentioned above, and once using weight dropping [5], where weights are set to zero upon crossing a threshold w 0 , finding estimates that were consistent with w T and of similar variance (these are omitted for clarity in what follows).…”

Section: Inverse-propensity Weightingmentioning

confidence: 99%

Off-Policy Evaluation of Probabilistic Identity Data in Lookalike Modeling

Cotta

Jiang

et al. 2019

Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining

View full text Add to dashboard Cite

We evaluate the impact of probabilistically-constructed digital identity data collected between Sep. 2017 and Dec. 2017, approximately, in the context of Lookalike-targeted campaigns. The backbone of this study is a large set of probabilistically-constructed "identities", represented as small bags of cookies and mobile ad identifiers with associated metadata, that are likely all owned by the same underlying user. The identity data allows to generate "identity-based", rather than "identifier-based", user models, giving a fuller picture of the interests of the users underlying the identifiers.We employ off-policy evaluation techniques to evaluate the potential of identity-powered lookalike models without incurring the risk of allowing untested models to direct large amounts of ad spend or the large cost of performing A/B tests. We add to historical work on off-policy evaluation by noting a significant type of "finite-sample bias" that occurs for studies combining modestlysized datasets and evaluation metrics based on ratios involving rare events (e.g., conversions). We illustrate this bias using a simulation study that later informs the handling of inverse propensity weights in our analyses on real data.We demonstrate significant lift in identity-powered lookalikes versus an identity-ignorant baseline: on average ∼70% lift in conversion rate, CVR, with a concordant drop in cost-per-acquisition, CPA.

show abstract

Section: Inverse-propensity Weightingmentioning

confidence: 99%

Off-Policy Evaluation of Probabilistic Identity Data in Lookalike Modeling

Cotta

Jiang

et al. 2019

Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining

View full text Add to dashboard Cite

show abstract

“…To perform evaluation of RL methods on ground-truth, a straightforward way is to evaluate the learned policy through online A/B test, which, however, could be prohibitively expensive and may hurt user experiences. Suggested by [6,7], a specific off-policy estimator NCIS [25] is employed to evaluate the performance of different recommender agents. The step-wise variant of NCIS is defined aŝ ρ t 1 :t 2 = min{c,…”

Section: Evaluation Settingmentioning

confidence: 99%

Reinforcement Learning to Optimize Long-term User Engagement in Recommender Systems

Zou

Xia

Ding

et al. 2019

Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery &Amp; Data Mining

202

120

View full text Add to dashboard Cite

Recommender systems play a crucial role in our daily lives. Feed streaming mechanism has been widely used in the recommender system, especially on the mobile Apps. The feed streaming setting provides users the interactive manner of recommendation in never-ending feeds. In such a manner, a good recommender system should pay more attention to user stickiness, which is far beyond classical instant metrics and typically measured by long-term user engagement. Directly optimizing long-term user engagement is a non-trivial problem, as the learning target is usually not available for conventional supervised learning methods. Though reinforcement learning (RL) naturally fits the problem of maximizing the long term rewards, applying RL to optimize long-term user engagement is still facing challenges: user behaviors are versatile to model, which typically consists of both instant feedback (e.g., clicks) and delayed feedback (e.g., dwell time, revisit); in addition, performing effective off-policy learning is still immature, especially when combining bootstrapping and function approximation.To address these issues, in this work, we introduce a RL framework -FeedRec to optimize the long-term user engagement. Fee-dRec includes two components: 1) a Q-Network which designed in hierarchical LSTM takes charge of modeling complex user behaviors, and 2) a S-Network, which simulates the environment, assists the Q-Network and voids the instability of convergence in policy learning. Extensive experiments on synthetic data and a real-world large scale data show that FeedRec effectively optimizes the longterm user engagement and outperforms state-of-the-arts. CCS CONCEPTS• Information systems → Recommender systems; Personalization; • Theory of computation → Sequential decision making. * Work performed during an internship at JD.com.

show abstract

“…Counterfactual evaluation [5,11,20] uses causal inference techniques -often inverse propensity scoring -to estimate how users would have responded to a different recommender algorithm. However, it is difficult to apply to commonly-used data sets and does not yield insight into the reliability of existing evaluations.…”

Section: Related Workmentioning

confidence: 99%

Estimating Error and Bias in Offline Evaluation Results

Tian

Ekstrand

2020

Proceedings of the 2020 Conference on Human Information Interaction and Retrieval

View full text Add to dashboard Cite

Offline evaluations of recommender systems attempt to estimate users' satisfaction with recommendations using static data from prior user interactions. These evaluations provide researchers and developers with first approximations of the likely performance of a new system and help weed out bad ideas before presenting them to users. However, offline evaluation cannot accurately assess novel, relevant recommendations, because the most novel items were previously unknown to the user, so they are missing from the historical data and cannot be judged as relevant.We present a simulation study to estimate the error that such missing data causes in commonly-used evaluation metrics in order to assess its prevalence and impact. We find that missing data in the rating or observation process causes the evaluation protocol to systematically mis-estimate metric values, and in some cases erroneously determine that a popularity-based recommender outperforms even a perfect personalized recommender. Substantial breakthroughs in recommendation quality, therefore, will be difficult to assess with existing offline techniques. CCS CONCEPTS• Information systems → Evaluation of retrieval results.

show abstract

Offline A/B Testing for Recommender Systems

Cited by 153 publications

References 20 publications

Off-Policy Evaluation of Probabilistic Identity Data in Lookalike Modeling

Off-Policy Evaluation of Probabilistic Identity Data in Lookalike Modeling

Reinforcement Learning to Optimize Long-term User Engagement in Recommender Systems

Estimating Error and Bias in Offline Evaluation Results

Contact Info

Product

Resources

About