A Survey on Reinforcement Learning for Recommender Systems

Lin, Yuanguo; Liu, Yong; Lin, Fan; Wu, Pengcheng; Zeng, Wenhua; Chen, Miao

doi:10.48550/arxiv.2109.10665

Cited by 4 publications

(4 citation statements)

References 124 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Finally, we define the negative sampling and rewards that are suitable for this MMIR scenario (Section 3.3). [1,9,22,32]. In this scenario, the users' interactions with the recommended items (actions) are returned as feedback (the so-called observations from the environments, such as views, clicks, skips, purchases, and ratings) to the recommendation agents, which usually convert the users' feedback into a reward signal [22].…”

Section: The Gommir Modelmentioning

confidence: 99%

Goal-Oriented Multi-Modal Interactive Recommendation with Verbal and Non-Verbal Relevance Feedback

Wu,

Macdonald,

Ounis

2023

Proceedings of the 17th ACM Conference on Recommender Systems

View full text Add to dashboard Cite

Interactive recommendation enables users to provide verbal and non-verbal relevance feedback (such as natural-language critiques and likes/dislikes) when viewing a ranked list of recommendations (such as images of fashion products), in order to guide the recommender system towards their desired items (i.e. goals) across multiple interaction turns. Such a multi-modal interactive recommendation (MMIR) task has been successfully formulated with deep reinforcement learning (DRL) algorithms by simulating the interactions between an environment (i.e. a user) and an agent (i.e. a recommender system). However, it is typically challenging and unstable to optimise the agent to improve the recommendation quality associated with implicit learning of multi-modal representations in an end-to-end fashion in DRL. This is known as the coupling of policy optimisation and representation learning. To address this coupling issue, we propose a novel goal-oriented multi-modal interactive recommendation model (GOMMIR) that uses both verbal and non-verbal relevance feedback to effectively incorporate the users' preferences over time. Specifically, our GOMMIR model employs a multi-task learning approach to explicitly learn the multi-modal representations using a multi-modal composition network when optimising the recommendation agent. Moreover, we formulate the MMIR task using goal-oriented reinforcement learning and enhance the optimisation objective by leveraging non-verbal relevance feedback for hard negative sampling and providing extra goal-oriented rewards to effectively optimise the recommendation agent. Following previous work, we train and evaluate our GOMMIR model by using user simulators that can generate natural-language feedback about the recommendations as a surrogate for real human users. Experiments conducted on four well-known fashion datasets demonstrate that our proposed GOMMIR model yields significant improvements in comparison to the existing state-of-the-art baseline models.

show abstract

Section: The Gommir Modelmentioning

confidence: 99%

Goal-Oriented Multi-Modal Interactive Recommendation with Verbal and Non-Verbal Relevance Feedback

Wu,

Macdonald,

Ounis

2023

Proceedings of the 17th ACM Conference on Recommender Systems

View full text Add to dashboard Cite

show abstract

“…As the tool for optimizing the long-term/delayed metrics [24], reinforcement learning (RL) has been widely studied for optimizing user retention in recent years [6]. Though they are capable of exploring and modeling users' dynamic interests [39], existing RL-based SRSs leave much to be desired due to the offline learning challenge.…”

Section: Introductionmentioning

confidence: 99%

User Retention-oriented Recommendation with Decision Transformer

Zhao

Zou

Zhao

et al. 2023

Proceedings of the ACM Web Conference 2023

View full text Add to dashboard Cite

Improving user retention with reinforcement learning (RL) has attracted increasing attention due to its significant importance in boosting user engagement. However, training the RL policy from scratch without hurting users' experience is unavoidable due to the requirement of trial-and-error searches. Furthermore, the offline methods, which aim to optimize the policy without online interactions, suffer from the notorious stability problem in value estimation or unbounded variance in counterfactual policy evaluation. To this end, we propose optimizing user retention with Decision Transformer (DT), which avoids the offline difficulty by translating the RL as an autoregressive problem. However, deploying the DT in recommendation is a non-trivial problem because of the following challenges: (1) deficiency in modeling the numerical reward value; (2) data discrepancy between the policy learning and recommendation generation; (3) unreliable offline performance evaluation. In this work, we, therefore, contribute a series of strategies for tackling the exposed issues. We first articulate an efficient reward prompt by weighted aggregation of meta embeddings for informative reward embedding. Then, we endow a weighted contrastive learning method to solve the discrepancy between training and inference. Furthermore, we design two robust offline metrics to measure user retention. Finally, the significant improvement in the benchmark datasets demonstrates the superiority of the proposed method. The implementation code is available at https://github.com/kesenzhao/DT4Rec.git. CCS CONCEPTS• Information Systems → Recommender Systems.

show abstract

“…With the development of interactive recommender systems (RSs), reinforcement learning for recommendation (RL4Rec) is receiving increased attention as reinforcement learning (RL) methods can quickly adapt to user feedback [2,32]. RL4Rec has been applied in a variety of domains, such as movie [60,62], news [68], and music recommendations [41].…”

Section: Introductionmentioning

confidence: 99%

State Encoders in Reinforcement Learning for Recommendation: A Reproducibility Study

Huang,

Oosterhuis,

Cetinkaya

et al. 2022

Preprint

View full text Add to dashboard Cite

Methods for reinforcement learning for recommendation (RL4Rec) are increasingly receiving attention as they can quickly adapt to user feedback. A typical RL4Rec framework consists of (1) a state encoder to encode the state that stores the users' historical interactions, and (2) an RL method to take actions and observe rewards. Prior work compared four state encoders in an environment where user feedback is simulated based on real-world logged user data. An attention-based state encoder was found to be the optimal choice as it reached the highest performance. However, this finding is limited to the actor-critic method, four state encoders, and evaluationsimulators that do not debias logged user data. In response to these shortcomings, we reproduce and expand on the existing comparison of attention-based state encoders (1) in the publicly available debiased RL4Rec SOFA simulator with (2) a different RL method, (3) more state encoders, and (4) a different dataset. Importantly, our experimental results indicate that existing findings do not generalize to the debiased SOFA simulator generated from a different dataset and a Deep Q-Network (DQN)-based method when compared with more state encoders.

show abstract

A Survey on Reinforcement Learning for Recommender Systems

Cited by 4 publications

References 124 publications

Goal-Oriented Multi-Modal Interactive Recommendation with Verbal and Non-Verbal Relevance Feedback

Goal-Oriented Multi-Modal Interactive Recommendation with Verbal and Non-Verbal Relevance Feedback

User Retention-oriented Recommendation with Decision Transformer

State Encoders in Reinforcement Learning for Recommendation: A Reproducibility Study

Contact Info

Product

Resources

About