2022
DOI: 10.48550/arxiv.2201.01534
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Reinforcement Online Learning to Rank with Unbiased Reward Shaping

Abstract: Online learning to rank (OLTR) aims to learn a ranker directly from implicit feedback derived from users' interactions, such as clicks. Clicks however are a biased signal: specifically, top-ranked documents are likely to attract more clicks than documents down the ranking (position bias). In this paper, we propose a novel learning algorithm for OLTR that uses reinforcement learning to optimize rankers: Reinforcement Online Learning to Rank (ROLTR). In ROLTR, the gradients of the ranker are estimated based on t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
1

Relationship

1
0

Authors

Journals

citations
Cited by 1 publication
(2 citation statements)
references
References 50 publications
0
2
0
Order By: Relevance
“…This is because it is not straightforward to ascertain why a user has not clicked on a passage: is it because the passage is not relevant or because the user did not examine the passage? Nevertheless, we realize that there have been recent attempts to de-bias unclick signals that have used IPS-like algorithms [43,49]: these can potentially be integrated into our CoRocchio algorithm to further leverage the user's negative implicit feedback for dense retrievers. We leave this to future exploration.…”
Section: Dealing With Unseen Queries In Corocchiomentioning
confidence: 99%
See 1 more Smart Citation
“…This is because it is not straightforward to ascertain why a user has not clicked on a passage: is it because the passage is not relevant or because the user did not examine the passage? Nevertheless, we realize that there have been recent attempts to de-bias unclick signals that have used IPS-like algorithms [43,49]: these can potentially be integrated into our CoRocchio algorithm to further leverage the user's negative implicit feedback for dense retrievers. We leave this to future exploration.…”
Section: Dealing With Unseen Queries In Corocchiomentioning
confidence: 99%
“…To fully control users' biases and noise so that algorithms can be tested under different, controllable conditions, it is common for research in online LTR and counterfactual LTR to simulate users' clicks by relying on the relevance labels recorded in offline datasets [16,19,37,39,49,50]. Following this practice, we use a click model to simulate click behaviour and create a synthetic click log.…”
Section: Synthetic User Behaviormentioning
confidence: 99%