2020
DOI: 10.48550/arxiv.2007.14430
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Munchausen Reinforcement Learning

Nino Vieillard,
Olivier Pietquin,
Matthieu Geist

Abstract: Bootstrapping is a core mechanism in Reinforcement Learning (RL). Most algorithms, based on temporal differences, replace the true value of a transiting state by their current estimate of this value. Yet, another estimate could be leveraged to bootstrap RL: the current policy. Our core contribution stands in a very simple idea: adding the scaled log-policy to the immediate reward. We show that slightly modifying Deep Q-Network (DQN) in that way provides an agent that is competitive with distributional methods … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
7
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
1

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(8 citation statements)
references
References 6 publications
1
7
0
Order By: Relevance
“…Our error propagation analysis is close in spirit to that of Scherrer et al (2015), recently extended to entropyregularized approximate dynamic programming algorithms by Geist et al (2019), Vieillard et al (2020a), andVieillard et al (2020b). One major difference between our approaches is that their guarantees depend on the p norms of the policy evaluation errors, but still optimize squared-Bellman-error-like quantities that only serve as proxy for these errors.…”
Section: Introductionsupporting
confidence: 61%

Logistic Q-Learning

Bas-Serrano,
Curi,
Krause
et al. 2020
Preprint
“…Our error propagation analysis is close in spirit to that of Scherrer et al (2015), recently extended to entropyregularized approximate dynamic programming algorithms by Geist et al (2019), Vieillard et al (2020a), andVieillard et al (2020b). One major difference between our approaches is that their guarantees depend on the p norms of the policy evaluation errors, but still optimize squared-Bellman-error-like quantities that only serve as proxy for these errors.…”
Section: Introductionsupporting
confidence: 61%

Logistic Q-Learning

Bas-Serrano,
Curi,
Krause
et al. 2020
Preprint
“…Recently, techniques to reduce the computational complexity of searching the game tree within the deep RL algorithm, and methods to make players' moves explainable to a huhttps://doi.org/10.17083/ijsg.v10i1.548 page. 22 man have also been researched [19][20][21][22]. These approaches could be used to complement the techniques in this paper to build faster and more human-interpretable techniques that can play HOTP, while addressing the player's limited observability of each others moves and the asymmetric nature in the game.…”
Section: Related Workmentioning
confidence: 99%
“…SAC is an off-policy actor critic algorithm, with the specificity of using regularization (hence the adjective soft). Regularization is a field of interest in RL research since efficient agents that uses it have been introduced recently [Vieillard et al, 2020a], [Vieillard et al, 2020b]. SAC is also an algorithm of interest when it comes to prioritizing the replay buffer [Wang and Ross, 2019], [Lahire et al, 2021], following what Schaul et al [2015] have initiated with Prioritized Experience Replay.…”
Section: Set-up and Notationsmentioning
confidence: 99%