Munchausen Reinforcement Learning

Vieillard, Nino; Pietquin, Olivier; Geist, Matthieu

doi:10.48550/arxiv.2007.14430

Cited by 5 publications

(8 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our error propagation analysis is close in spirit to that of Scherrer et al (2015), recently extended to entropyregularized approximate dynamic programming algorithms by Geist et al (2019), Vieillard et al (2020a), andVieillard et al (2020b). One major difference between our approaches is that their guarantees depend on the p norms of the policy evaluation errors, but still optimize squared-Bellman-error-like quantities that only serve as proxy for these errors.…”

Section: Introductionsupporting

confidence: 61%

Logistic Q-Learning

Bas-Serrano,

Curi,

Krause

et al. 2020

Preprint

View full text Add to dashboard Cite

We propose a new reinforcement learning algorithm derived from a regularized linearprogramming formulation of optimal control in MDPs. The method is closely related to the classic Relative Entropy Policy Search (REPS) algorithm of Peters et al. ( 2010), with the key difference that our method introduces a Q-function that enables efficient exact model-free implementation. The main feature of our algorithm (called Q-REPS) is a convex loss function for policy evaluation that serves as a theoretically sound alternative to the widely used squared Bellman error. We provide a practical saddle-point optimization method for minimizing this loss function and provide an error-propagation analysis that relates the quality of the individual updates to the performance of the output policy. Finally, we demonstrate the effectiveness of our method on a range of benchmark problems.

show abstract

Section: Introductionsupporting

confidence: 61%

Logistic Q-Learning

Bas-Serrano,

Curi,

Krause

et al. 2020

Preprint

View full text Add to dashboard Cite

show abstract

“…Recently, techniques to reduce the computational complexity of searching the game tree within the deep RL algorithm, and methods to make players' moves explainable to a huhttps://doi.org/10.17083/ijsg.v10i1.548 page. 22 man have also been researched [19][20][21][22]. These approaches could be used to complement the techniques in this paper to build faster and more human-interpretable techniques that can play HOTP, while addressing the player's limited observability of each others moves and the asymmetric nature in the game.…”

Section: Related Workmentioning

confidence: 99%

Improved Reinforcement Learning in Asymmetric Real-time Strategy Games via Strategy Diversity

Dasgupta¹,

Kliem²

2023

IJSG

View full text Add to dashboard Cite

We investigate the use of artificial intelligence (AI)-based techniques in learning to play a 2-player, real-time strategy (RTS) game called Hunting-of-the-Plark. The game is challenging to play for both humans and AI-based techniques because players cannot observe each other's moves while playing the game and one player is at a disadvantage due to the asymmetric nature of the game rules. We analyze the performance of different deep reinforcement learning algorithms to train software agents that can play the game. Existing reinforcement learning techniques for RTS games enable players to converge towards an equilibrium outcome of the game but usually do not facilitate further exploration of techniques to exploit and defeat the opponent. To address this shortcoming, we investigate techniques including self-play and strategy diversity that can be used by players to improve their performance beyond the equilibrium outcome. We observe that when players use self-play, their number of wins begins to cycle around an equilibrium value as each player quickly learns to outwit and defeat its opponent and vice-versa. Finally, we show that strategy diversity could be used as an effective means to alleviate the performance of the disadvantaged player caused by the asymmetric nature of the game.

show abstract

“…SAC is an off-policy actor critic algorithm, with the specificity of using regularization (hence the adjective soft). Regularization is a field of interest in RL research since efficient agents that uses it have been introduced recently [Vieillard et al, 2020a], [Vieillard et al, 2020b]. SAC is also an algorithm of interest when it comes to prioritizing the replay buffer [Wang and Ross, 2019], [Lahire et al, 2021], following what Schaul et al [2015] have initiated with Prioritized Experience Replay.…”

Section: Set-up and Notationsmentioning

confidence: 99%

Actor Loss of Soft Actor Critic Explained

Lahire¹

2021

Preprint

View full text Add to dashboard Cite

This technical report is devoted to explaining how the actor loss of soft actor critic is obtained, as well as the associated gradient estimate. It gives the necessary mathematical background to derive all the presented equations, from the theoretical actor loss to the one implemented in practice. This necessitates a comparison of the reparameterization trick used in soft actor critic with the nabla log trick, which leads to open questions regarding the most efficient method to use.

show abstract

Munchausen Reinforcement Learning

Cited by 5 publications

References 6 publications

Logistic Q-Learning

Logistic Q-Learning

Improved Reinforcement Learning in Asymmetric Real-time Strategy Games via Strategy Diversity

Actor Loss of Soft Actor Critic Explained

Contact Info

Product

Resources

About