Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence 2019
DOI: 10.24963/ijcai.2019/379
|View full text |Cite
|
Sign up to set email alerts
|

DeepMellow: Removing the Need for a Target Network in Deep Q-Learning

Abstract: Deep Q-Network (DQN) is an algorithm that achieves human-level performance in complex domains like Atari games. One of the important elements of DQN is its use of a target network, which is necessary to stabilize learning. We argue that using a target network is incompatible with online reinforcement learning, and it is possible to achieve faster and more stable learning without a target network when we use Mellowmax, an alternative softmax operator. We derive novel properties of Mellowmax, and empirically sho… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
32
0
1

Year Published

2019
2019
2024
2024

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 45 publications
(34 citation statements)
references
References 8 publications
1
32
0
1
Order By: Relevance
“…However, it is not trivial to integrate Constrained DQN with DDQN and its extension called Weighted Double Q learning (Zhang et al, 2017), because in these methods the target network was used to decompose the max operation into action selection and action evaluation. To reduce the problem of overestimation, the mellowmax operator (Kim et al, 2019) is promising, which is a variant of Soft Q learning.…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…However, it is not trivial to integrate Constrained DQN with DDQN and its extension called Weighted Double Q learning (Zhang et al, 2017), because in these methods the target network was used to decompose the max operation into action selection and action evaluation. To reduce the problem of overestimation, the mellowmax operator (Kim et al, 2019) is promising, which is a variant of Soft Q learning.…”
Section: Discussionmentioning
confidence: 99%
“…van increased the update frequency of the target network from 10,000 to 30,000 to reduce overestimation of the action values. It is known that using the target network technique disrupts online reinforcement learning and slows down learning because the value is not propagated unless the target network is updated (Lillicrap et al, 2016;Kim et al, 2019). Consequently, the number of samples required for learning becomes extremely large.…”
Section: Introductionmentioning
confidence: 99%
“…The Mellowmax operator is theoretically superior to the minimization operator in that it may reduce the overestimation problem and potentially remove the need for a separate target network in deep Q-learning. In [17], Kim, Asadi, Littman, and Konidaris mainly consider the case when ω > 0, but in this thesis we will include the case when ω < 0 when proving its properties. A comparison between algorithms using the Mellowmax operator and those without it will be presented in Section 4 and 5 to see if it indeed can improve the Q-learning accuracy.…”
Section: Mellowmax and Deepmellowmentioning
confidence: 99%
“…• For an action a ∈ A x , the partial derivative with respect to Q(x, a) is given as: • Taking partial differentiation with respect to ω, we have: In the discipline of deep reinforcement learning, Kim et al [17] have challenged the importance of the delayed update of a separate target network. Because of the delayed update, the action-value functions are not continually updated, which hinders faster learning.…”
Section: Mellowmax and Deepmellowmentioning
confidence: 99%
See 1 more Smart Citation