Offline Evaluation of Ranking Policies with Click Models

Li, Shuai; Abbasi-Yadkori, Yasin; Kveton, Branislav; Muthukrishnan, S.; Vinay, V.; Wen, Zheng

doi:10.1145/3219819.3220028

Cited by 44 publications

(48 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In other cases, simplifying assumptions can be deployed to reduce the variance of IPS estimators in slate recommendation. Li et al assume that the reward for each sub-action is independent of other sub-actions in the slate [15]; this is a typical assumption to greatly reduce the size of the action space that is used in practical applications [5]. To simplify slate reward estimation, Swaminathan et al take an additive approach to the slate reward and assume that the sub-action rewards are unobserved and independent [26].…”

Section: Traditional Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Counterfactual Evaluation of Slate Recommendations with Sequential Reward Interactions

McInerney

Brost

Chandar

et al. 2020

Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery &Amp; Data Mining

View full text Add to dashboard Cite

Users of music streaming, video streaming, news recommendation, and e-commerce services often engage with content in a sequential manner. Providing and evaluating good sequences of recommendations is therefore a central problem for these services. Prior reweighting-based counterfactual evaluation methods either suffer from high variance or make strong independence assumptions about rewards. We propose a new counterfactual estimator that allows for sequential interactions in the rewards with lower variance in an asymptotically unbiased manner. Our method uses graphical assumptions about the causal relationships of the slate to reweight the rewards in the logging policy in a way that approximates the expected sum of rewards under the target policy. Extensive experiments in simulation and on a live recommender system show that our approach outperforms existing methods in terms of bias and data e ciency for the sequential track recommendations problem.

show abstract

Section: Traditional Methodsmentioning

confidence: 99%

“…Independent IPS (IIPS): Proposed by Li et al[15], the estimator treats the reward at each position independently. The average reward of the target policy is estimated using Eq.…”

mentioning

confidence: 99%

Counterfactual Evaluation of Slate Recommendations with Sequential Reward Interactions

McInerney

Brost

Chandar

et al. 2020

Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery &Amp; Data Mining

View full text Add to dashboard Cite

show abstract

“…Offline unbiased estimator. Since the historical logged data is generated by a logging production policy, we propose an unbiased offline estimator, User Browsing Inverse Propensity Scoring (UBM-IPS) estimator, to evaluate the performance inspired by Li et al [15]. The idea is to estimate users' clicks on various positions based on UBM model and the detail of reduction is at Appendix.…”

Section: Metrics and Evaluation Methodsmentioning

confidence: 99%

Contextual User Browsing Bandits for Large-Scale Online Mobile Recommendation

et al. 2020

Fourteenth ACM Conference on Recommender Systems

View full text Add to dashboard Cite

Online recommendation services recommend multiple commodities to users. Nowadays, a considerable proportion of users visit e-commerce platforms by mobile devices. Due to the limited screen size of mobile devices, positions of items have a significant influence on clicks: 1) Higher positions lead to more clicks for one commodity. 2) The 'pseudo-exposure' issue: Only a few recommended items are shown at first glance and users need to slide the screen to browse other items. Therefore, some recommended items ranked behind are not viewed by users and it is not proper to treat this kind of items as negative samples. While many works model the online recommendation as contextual bandit problems, they rarely take the influence of positions into consideration and thus the estimation of the reward function may be biased. In this paper, we aim at addressing these two issues to improve the performance of online mobile recommendation. Our contributions are four-fold. First, since we concern the reward of a set of recommended items, we model the online recommendation as a contextual combinatorial bandit problem and define the reward of a recommended set. Second, we propose a novel contextual combinatorial bandit method called UBM-LinUCB to address two issues related to positions by adopting the User Browsing Model (UBM), a click model for web search. Third, we provide a formal regret analysis and prove that our algorithm achieves sublinear regret independent of the number of items. Finally, we evaluate our algorithm on two real-world datasets by a novel unbiased estimator. An online experiment is also implemented in Taobao, one of the most popular e-commerce platforms in the world. Results on two CTR metrics show that our algorithm outperforms the other contextual bandit algorithms. CCS CONCEPTS • Applied computing → Online shopping; • Information systems → Recommender systems.

show abstract

“…Komiyama et al [19] and subsequently Lagrée et al [20] use such propensities to find the optimal ranking for a single query by casting the ranking problem as a multiple-play bandit. Li et al [21] use similar propensities to counterfactually evaluate ranking policies where they estimate the number of clicks a ranking policy will receive. Our policy-aware approach contrasts with these existing methods by providing an unbiased estimate of LTR-metric-based losses, and thus it can be used to optimize LTR models similar to supervised LTR.…”

Section: Related Workmentioning

confidence: 99%

Policy-Aware Unbiased Learning to Rank for Top-k Rankings

Oosterhuis

Rijke

2020

Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval

View full text Add to dashboard Cite

Counterfactual Learning to Rank (LTR) methods optimize ranking systems using logged user interactions that contain interaction biases. Existing methods are only unbiased if users are presented with all relevant items in every ranking. There is currently no existing counterfactual unbiased LTR method for top-rankings. We introduce a novel policy-aware counterfactual estimator for LTR metrics that can account for the effect of a stochastic logging policy. We prove that the policy-aware estimator is unbiased if every relevant item has a non-zero probability to appear in the top-ranking. Our experimental results show that the performance of our estimator is not affected by the size of : for any , the policy-aware estimator reaches the same retrieval performance while learning from topfeedback as when learning from feedback on the full ranking. Lastly, we introduce novel extensions of traditional LTR methods to perform counterfactual LTR and to optimize top-metrics. Together, our contributions introduce the first policy-aware unbiased LTR approach that learns from top-feedback and optimizes top-metrics. As a result, counterfactual LTR is now applicable to the very prevalent top-ranking setting in search and recommendation.

show abstract

Offline Evaluation of Ranking Policies with Click Models

Cited by 44 publications

References 33 publications

Counterfactual Evaluation of Slate Recommendations with Sequential Reward Interactions

Counterfactual Evaluation of Slate Recommendations with Sequential Reward Interactions

Contextual User Browsing Bandits for Large-Scale Online Mobile Recommendation

Policy-Aware Unbiased Learning to Rank for Top-k Rankings

Contact Info

Product

Resources

About