Weighted Double Q-learning

Zhang, Zongzhang; Pan, Zhiyuan; Kochenderfer, Mykel J.

doi:10.24963/ijcai.2017/483

Cited by 78 publications

(36 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…We plan to introduce the ensemble Q-network [ 23 , 37 , 38 ] or weighted Q estimates [ 39 , 40 ] to reduce the biases of the estimated Q-values and improve the stability of our algorithm. Another important aspect would be improving the exploration ability, which will further enhance the performance of our algorithm in the face of a stronger opponent bot.…”

Section: Discussionmentioning

confidence: 99%

Learning Macromanagement in Starcraft by Deep Reinforcement Learning

Huang

Yin

Zhang

et al. 2021

Sensors

View full text Add to dashboard Cite

StarCraft is a real-time strategy game that provides a complex environment for AI research. Macromanagement, i.e., selecting appropriate units to build depending on the current state, is one of the most important problems in this game. To reduce the requirements for expert knowledge and enhance the coordination of the systematic bot, we select reinforcement learning (RL) to tackle the problem of macromanagement. We propose a novel deep RL method, Mean Asynchronous Advantage Actor-Critic (MA3C), which computes the approximate expected policy gradient instead of the gradient of sampled action to reduce the variance of the gradient, and encode the history queue with recurrent neural network to tackle the problem of imperfect information. The experimental results show that MA3C achieves a very high rate of winning, approximately 90%, against the weaker opponents and it improves the win rate about 30% against the stronger opponents. We also propose a novel method to visualize and interpret the policy learned by MA3C. Combined with the visualized results and the snapshots of games, we find that the learned macromanagement not only adapts to the game rules and the policy of the opponent bot, but also cooperates well with the other modules of MA3C-Bot.

show abstract

Section: Discussionmentioning

confidence: 99%

Learning Macromanagement in Starcraft by Deep Reinforcement Learning

Huang

Yin

Zhang

et al. 2021

Sensors

View full text Add to dashboard Cite

show abstract

“…However, it is not trivial to integrate Constrained DQN with DDQN and its extension called Weighted Double Q learning (Zhang et al, 2017), because in these methods the target network was used to decompose the max operation into action selection and action evaluation. To reduce the problem of overestimation, the mellowmax operator (Kim et al, 2019) is promising, which is a variant of Soft Q learning.…”

Section: Discussionmentioning

confidence: 99%

Constrained Deep Q-Learning Gradually Approaching Ordinary Q-Learning

Ohnishi

Uchibe²,

Yamaguchi

et al. 2019

Front. Neurorobot.

View full text Add to dashboard Cite

A deep Q network (DQN) (Mnih et al., 2013) is an extension of Q learning, which is a typical deep reinforcement learning method. In DQN, a Q function expresses all action values under all states, and it is approximated using a convolutional neural network. Using the approximated Q function, an optimal policy can be derived. In DQN, a target network, which calculates a target value and is updated by the Q function at regular intervals, is introduced to stabilize the learning process. A less frequent updates of the target network would result in a more stable learning process. However, because the target value is not propagated unless the target network is updated, DQN usually requires a large number of samples. In this study, we proposed Constrained DQN that uses the difference between the outputs of the Q function and the target network as a constraint on the target value. Constrained DQN updates parameters conservatively when the difference between the outputs of the Q function and the target network is large, and it updates them aggressively when this difference is small. In the proposed method, as learning progresses, the number of times that the constraints are activated decreases. Consequently, the update method gradually approaches conventional Q learning. We found that Constrained DQN converges with a smaller training dataset than in the case of DQN and that it is robust against changes in the update frequency of the target network and settings of a certain parameter of the optimizer. Although Constrained DQN alone does not show better performance in comparison to integrated approaches nor distributed methods, experimental results show that Constrained DQN can be used as an additional components to those methods.

show abstract

“…It is apparent that the probability of selecting a state-action pair is increased when an ErrP is absent following 800 ms of occurrence of an ERD/ERS. The proposed work on probabilistic reinforcement learning (PRL), is compared with two variants of Double Q-Learning namely, DQL1 [40] and DQL2 [41], Rainbow Algorithm [42], and Deep Reinforcement Learning (DRL) [43]. Fig.…”

Section: F Classificationmentioning

confidence: 99%

EEG-Induced Autonomous Game-Teaching to a Robot Arm by Human Trainers Using Reinforcement Learning

et al. 2022

View full text Add to dashboard Cite

This paper deals with a simple indoor game, where the player has to pass a ball through a ring fixed on a variable pan-tilt platform. The motivation of the research is to learn the gaming actions of an experienced player by a robot arm for subsequent training to younger children (trainee) by the robot. The robot learns the gaming actions of the player at different game states, determined by pan-tilt orientations of the ring and its radial distance with respect to the player. The actions of the experienced player/expert are defined by six parameters: three junction-coordinates in the right arm of the player and the 3dimensional speed of the ball in a given throw. Reinforcement learning is employed here to adapt a state-action probability matrix of a probabilistic learning automation based on the reward (or penalty) scores of the player due to the success (or failure) in passing the ball through a given ring. A hybrid braincomputer interface (BCI) is used to detect the failures in the gaming action of the player by natural arousal of Error-related Potential (ErrP) signal following motor execution, indicated by motor imageries. In absence (presence) of ErrP after a motor imagination, the system considers a success (failure) in the player's trials, and thus adapts the probabilities in the learning automata according to success/failure of individual game instances. After the convergence of the state-action probability matrix, the same is used for planning, where the action corresponding to the highest probability at a given state in the automaton is selected for execution. The robot can autonomously train the game to the children using the learning automaton with converged probability scores. Experiments undertaken confirm that the success rate of the robot arm in the motor execution phase is very high (above 90%) when the ring is placed at a moderate distance of 4 feet from the robot.

show abstract

Weighted Double Q-learning

Cited by 78 publications

References 7 publications

Learning Macromanagement in Starcraft by Deep Reinforcement Learning

Learning Macromanagement in Starcraft by Deep Reinforcement Learning

Constrained Deep Q-Learning Gradually Approaching Ordinary Q-Learning

EEG-Induced Autonomous Game-Teaching to a Robot Arm by Human Trainers Using Reinforcement Learning

Contact Info

Product

Resources

About