2021
DOI: 10.48550/arxiv.2108.07041
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Implicitly Regularized RL with Implicit Q-Values

Abstract: The Q-function is a central quantity in many Reinforcement Learning (RL) algorithms for which RL agents behave following a (soft)-greedy policy w.r.t. to Q. It is a powerful tool that allows action selection without a model of the environment and even without explicitly modeling the policy. Yet, this scheme can only be used in discrete action tasks, with small numbers of actions, as the softmax cannot be computed exactly otherwise. Especially the usage of function approximation, to deal with continuous action … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2

Citation Types

0
2
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
2

Relationship

1
1

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 7 publications
0
2
0
Order By: Relevance
“…Our work bridges the theoretical gap between RL and Max-Ent RL by introducing our Gumbel loss function. Unlike past work in MaxEnt RL (Haarnoja et al, 2018;Eysenbach & Levine, 2020), our method does not require explicit entropy estimation and instead addresses the problem of obtaining soft-value estimates (LogSumExp) in high-dimensional or continuous spaces (Vieillard et al, 2021) by directly modeling them via our proposed Gumbel loss, which to our knowledge has not previously been used in RL. Our loss objective is intrinsically linked to the KL divergence, and similar objectives have been used for mutual information estimation (Poole et al, 2019) and statistical learning Atiyah et al (2020).…”
Section: Related Workmentioning
confidence: 99%
“…Our work bridges the theoretical gap between RL and Max-Ent RL by introducing our Gumbel loss function. Unlike past work in MaxEnt RL (Haarnoja et al, 2018;Eysenbach & Levine, 2020), our method does not require explicit entropy estimation and instead addresses the problem of obtaining soft-value estimates (LogSumExp) in high-dimensional or continuous spaces (Vieillard et al, 2021) by directly modeling them via our proposed Gumbel loss, which to our knowledge has not previously been used in RL. Our loss objective is intrinsically linked to the KL divergence, and similar objectives have been used for mutual information estimation (Poole et al, 2019) and statistical learning Atiyah et al (2020).…”
Section: Related Workmentioning
confidence: 99%
“…This naive discretization is problematic in high-dimensional action spaces, as the number of actions grows exponentially with the action dimensionality. To mitigate this phenomenon, a possible strategy is to assume that action dimensions are independent [Tavakoli et al, 2018, Andrychowicz et al, 2020, Vieillard et al, 2021, or to assume some causal dependence between them and use an autoregressive discretization [Metz et al, 2017, Tessler et al, 2019, Tang and Agrawal, 2020. The AQuaDem framework circumvents the curse of dimensionality as the discretization is based on the demonstrations and hence is dependent on the multimodality of the actions picked by the demonstrator rather than the dimensionality of the action space.…”
Section: Related Workmentioning
confidence: 99%