DeepMellow: Removing the Need for a Target Network in Deep Q-Learning

S, Kim; Asadi, Kavosh; Littman, Michael L.; Konidaris, George

doi:10.24963/ijcai.2019/379

Cited by 45 publications

(34 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…However, it is not trivial to integrate Constrained DQN with DDQN and its extension called Weighted Double Q learning (Zhang et al, 2017), because in these methods the target network was used to decompose the max operation into action selection and action evaluation. To reduce the problem of overestimation, the mellowmax operator (Kim et al, 2019) is promising, which is a variant of Soft Q learning.…”

Section: Discussionmentioning

confidence: 99%

“…van increased the update frequency of the target network from 10,000 to 30,000 to reduce overestimation of the action values. It is known that using the target network technique disrupts online reinforcement learning and slows down learning because the value is not propagated unless the target network is updated (Lillicrap et al, 2016;Kim et al, 2019). Consequently, the number of samples required for learning becomes extremely large.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Constrained Deep Q-Learning Gradually Approaching Ordinary Q-Learning

Ohnishi

Uchibe²,

Yamaguchi

et al. 2019

Front. Neurorobot.

View full text Add to dashboard Cite

A deep Q network (DQN) (Mnih et al., 2013) is an extension of Q learning, which is a typical deep reinforcement learning method. In DQN, a Q function expresses all action values under all states, and it is approximated using a convolutional neural network. Using the approximated Q function, an optimal policy can be derived. In DQN, a target network, which calculates a target value and is updated by the Q function at regular intervals, is introduced to stabilize the learning process. A less frequent updates of the target network would result in a more stable learning process. However, because the target value is not propagated unless the target network is updated, DQN usually requires a large number of samples. In this study, we proposed Constrained DQN that uses the difference between the outputs of the Q function and the target network as a constraint on the target value. Constrained DQN updates parameters conservatively when the difference between the outputs of the Q function and the target network is large, and it updates them aggressively when this difference is small. In the proposed method, as learning progresses, the number of times that the constraints are activated decreases. Consequently, the update method gradually approaches conventional Q learning. We found that Constrained DQN converges with a smaller training dataset than in the case of DQN and that it is robust against changes in the update frequency of the target network and settings of a certain parameter of the optimizer. Although Constrained DQN alone does not show better performance in comparison to integrated approaches nor distributed methods, experimental results show that Constrained DQN can be used as an additional components to those methods.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Constrained Deep Q-Learning Gradually Approaching Ordinary Q-Learning

Ohnishi

Uchibe²,

Yamaguchi

et al. 2019

Front. Neurorobot.

View full text Add to dashboard Cite

show abstract

“…The Mellowmax operator is theoretically superior to the minimization operator in that it may reduce the overestimation problem and potentially remove the need for a separate target network in deep Q-learning. In [17], Kim, Asadi, Littman, and Konidaris mainly consider the case when ω > 0, but in this thesis we will include the case when ω < 0 when proving its properties. A comparison between algorithms using the Mellowmax operator and those without it will be presented in Section 4 and 5 to see if it indeed can improve the Q-learning accuracy.…”

Section: Mellowmax and Deepmellowmentioning

confidence: 99%

“…• For an action a ∈ A x , the partial derivative with respect to Q(x, a) is given as: • Taking partial differentiation with respect to ω, we have: In the discipline of deep reinforcement learning, Kim et al [17] have challenged the importance of the delayed update of a separate target network. Because of the delayed update, the action-value functions are not continually updated, which hinders faster learning.…”

Section: Mellowmax and Deepmellowmentioning

confidence: 99%

“…On the contrary, a extremely small ω will behave like an arithmetic average operator assigning equal probabilities to each action in the action space, which ignores the information that the agent has learned from past experiences. A moderate value of ω clearly outperforms extremes, but the optimal ranges of ω vary with model settings [17]. Kim and Konidaris [16] have proposed an adaptive method to tune for ω in practice shown in Algorithm 4, in which 2 batches of transitions will be sampled from the replay buffer.…”

Section: Mellowmax and Deepmellowmentioning

confidence: 99%

See 1 more Smart Citation

Several Reinforcement Learning Methods in Mean-Field Games with Binary Action Spaces

Zhang¹

View full text Add to dashboard Cite

Recent years have witnessed significant progress in the sub-field of machine learning known as reinforcement learning, in which interactions between intelligent agents and the environment enable agents to learn and solve sequential decision-making problems through accumulating rewards with delays. Despite much success in single-player settings, reinforcement learning in multi-agent domains remains a challenging task in many aspects. In this thesis, the mean-field approach will be used to study binary action space stochastic games with a sufficiently large number of players that can be generalized to the multi-population case. Based on the mean-field approximation, several algorithms will be implemented and compared in numerical experiments to visualize their convergence to the equilibrium policy.It is my greatest honor to dedicate this thesis to my supervisor, Minyi Huang, without whom by no means could I finish this thesis. It was Professor Huang who introduced me to the mean-field games theory; it was Professor Huang who academically guided me and financially supported me; it was Professor Huang who cheered me up when I was deeply frustrated by the slow progress of my work. During the pandemic, face-to-face communication was unavailable, but Professor Huang strived to maintain a routine video communication with me. Moreover, Professor Huang invited me to a variety of virtual academic conferences, which facilitated me to grasp a deeper understanding of the meanfield game theory and reinforcement learning.I would like to also dedicate my thesis to my parents. Due to Covid-19, I had to spend another academic year to complete my thesis, and my parents were still willing to support me, both financially and emotionally. My family originally comes from Wuhan, where the first outbreak of Covid-19 was observed, so last year was a tough year for my entire family. I hope this thesis marks a new beginning for me and my family.

show abstract