Advantage Amplification in Slowly Evolving Latent-State Environments

Mladenov, Martin; Meshi, Ofer; Ooi, Jayden; Schuurmans, Dale; Boutilier, Craig

doi:10.48550/arxiv.1905.13559

Cited by 2 publications

(2 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Although we could not enjoy advantage in the simulator's efficiency, maintaining move still facilitates training and this is solely because maintaining move decision increases the influence of a single move decision, as we will confirm with experiments. In this sense, maintaining move rather can be viewed as 'amplifying advantage' from (Mladenov et al 2019).…”

Section: Maintaining Move Actionmentioning

confidence: 99%

Creating Pro-Level AI for a Real-Time Fighting Game Using Deep Reinforcement Learning

Rho

Moon

et al. 2022

IEEE Trans. Games

View full text Add to dashboard Cite

Reinforcement learning combined with deep neural networks has performed remarkably well in many genres of games recently. It has surpassed human-level performance in fixed game environments and turn-based two player board games. However, to the best of our knowledge, current research has yet to produce a result that has surpassed human-level performance in modern complex fighting games. This is due to the inherent difficulties with real-time fighting games, including: vast action spaces, action dependencies, and imperfect information. We overcame these challenges and made 1v1 battle AI agents for the commercial game "Blade & Soul". The trained agents competed against five professional gamers and achieved a win rate of 62%. This paper presents a practical reinforcement learning method that includes a novel self-play curriculum and data skipping techniques. Through the curriculum, three different styles of agents were created by reward shaping and were trained against each other. Additionally, this paper suggests data skipping techniques that could increase data efficiency and facilitate explorations in vast spaces. Since our method can be generally applied to all two-player competitive games with vast action spaces, we anticipate its application to game development including level design and automated balancing.

show abstract

Section: Maintaining Move Actionmentioning

confidence: 99%

Creating Pro-Level AI for a Real-Time Fighting Game Using Deep Reinforcement Learning

Rho

Moon

et al. 2022

IEEE Trans. Games

View full text Add to dashboard Cite

show abstract

“…The underlying MDP One could cast the recommendation problem as a POMDP (Lu & Yang, 2016;Mladenov et al, 2019) in which the state of the environment is hidden and contains the user's internal state, which evolves over time. Equivalently, one can consider the belief-MDP induced by the recommender POMDP (Kaelbling et al, 1998), and approximate a solution to such belief-MDP via Deep-RL with a policy trained with observation histories as input (this is theoretically sufficient for the policy to recover a belief over the current hidden state and take the optimal action).…”

Section: F Computing Metricsmentioning

confidence: 99%

Estimating and Penalizing Induced Preference Shifts in Recommender Systems

Carroll¹,

Hadfield-Menell²,

Russell³

et al. 2022

Preprint

View full text Add to dashboard Cite

The content that a recommender system (RS) shows to users influences them. Therefore, when choosing which recommender to deploy, one is implicitly also choosing to induce specific internal states in users. Even more, systems trained via long-horizon optimization will have direct incentives to manipulate users, e.g. shift their preferences so they are easier to satisfy. In this work we focus on induced preference shifts in users. We argue that -before deployment -system designers should: estimate the shifts a recommender would induce; evaluate whether such shifts would be undesirable; and even actively optimize to avoid problematic shifts. These steps involve two challenging ingredients: estimation requires anticipating how hypothetical policies would influence user preferences if deployed -we do this by using historical user interaction data to train predictive user model which implicitly contains their preference dynamics; evaluation and optimization additionally require metrics to assess whether such influences are manipulative or otherwise unwanted -we use the notion of "safe shifts", that define a trust region within which behavior is safe. In simulated experiments, we show that our learned preference dynamics model is effective in estimating user preferences and how they would respond to new recommenders. Additionally, we show that recommenders that optimize for staying in the trust region can avoid manipulative behaviors while still generating engagement.

show abstract

Advantage Amplification in Slowly Evolving Latent-State Environments

Cited by 2 publications

References 0 publications

Creating Pro-Level AI for a Real-Time Fighting Game Using Deep Reinforcement Learning

Creating Pro-Level AI for a Real-Time Fighting Game Using Deep Reinforcement Learning

Estimating and Penalizing Induced Preference Shifts in Recommender Systems

Contact Info

Product

Resources

About