A novel multi-step Q-learning method to improve data efficiency for deep reinforcement learning

Yuan, Yi; Yu, Zhu Liang; Gu, Zhenghui; Yeboah, Yao; Wu, Wei; Deng, Xinyang; Li, Yuanqing

doi:10.1016/j.knosys.2019.03.018

Cited by 45 publications

(7 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As a solution to this, deep reinforcement learning (DRL), which combines deep learning and reinforcement learning, is considered to be an effective alternative. For example, multistep learning- Deep Q-learning Network (DQN) [ 24 ] proposed the concept of using multilayered compensation after a one-step bootstrap when calculating the target Q value. If Q-learning is performed in advance by using the reward information after an n-step bootstrap, it is expected that the amount of computation required for learning can be greatly reduced.…”

Section: Discussionmentioning

confidence: 99%

Q-LBR: Q-Learning Based Load Balancing Routing for UAV-Assisted VANET

Roh

Han

Ham

et al. 2020

Sensors

View full text Add to dashboard Cite

Although various unmanned aerial vehicle (UAV)-assisted routing protocols have been proposed for vehicular ad hoc networks, few studies have investigated load balancing algorithms to accommodate future traffic growth and deal with complex dynamic network environments simultaneously. In particular, owing to the extended coverage and clear line-of-sight relay link on a UAV relay node (URN), the possibility of a bottleneck link is high. To prevent problems caused by traffic congestion, we propose Q-learning based load balancing routing (Q-LBR) through a combination of three key techniques, namely, a low-overhead technique for estimating the network load through the queue status obtained from each ground vehicular node by the URN, a load balancing scheme based on Q-learning and a reward control function for rapid convergence of Q-learning. Through diverse simulations, we demonstrate that Q-LBR improves the packet delivery ratio, network utilization and latency by more than 8, 28 and 30%, respectively, compared to the existing protocol.

show abstract

Section: Discussionmentioning

confidence: 99%

Q-LBR: Q-Learning Based Load Balancing Routing for UAV-Assisted VANET

Roh

Han

Ham

et al. 2020

Sensors

View full text Add to dashboard Cite

show abstract

“…However, this strategy introduces an element of estimating bias. To minimize variance while maintaining a little bias, Schulman et al [28] proposed a generalized advantage function to solve this weakness as follows: To take it a step further, Schulman et al [29] then introduced a technique known as trust region policy optimization (TRPO). To broaden the application to large-scale state space DRL tasks, the TRPO method parameterizes the strategy using deep neural networks and achieves end-toend control using only the original input image.…”

Section: Deep Reinforcement Learning Based On the Policy Gradient For...mentioning

confidence: 99%

Deep Reinforcement Learning for Stock Prediction

Zhang

Lei

2022

Scientific Programming

View full text Add to dashboard Cite

Investors are frequently concerned with the potential return from changes in a company’s stock price. However, stock price fluctuations are frequently highly nonlinear and nonstationary, rendering them to be uncontrollable and the primary reason why the majority of investors earn low long-term returns. Historically, people have always simulated and predicted using classic econometric models and simple machine learning models. In recent years, an increasing amount of research has been conducted using more complex machine learning and deep learning methods to forecast stock prices, and their research reports also indicate that their prediction accuracy is gradually improving. While the prediction results and accuracy of these models improve over time, their adaptability in a volatile market environment is questioned. Highly optimized machine learning algorithms include the following: FNN and the RNN are incapable of predicting the stock price of random walks and their results are frequently not consistent with stock price movements. The purpose of this article is to increase the accuracy and speed of stock price volatility prediction by incorporating the PG method’s deep reinforcement learning model. Finally, our tests demonstrate that the new algorithm’s prediction accuracy and reward convergence speed are significantly higher than those of the traditional DRL algorithm. As a result, the new algorithm is more adaptable to fluctuating market conditions.

show abstract

“…Additionally, a secondary target network is introduced to address issues of oscillation and divergence during the learning process. Despite the DQN's success in diverse applications and its widespread use in autonomous navigation [14,15], it has some drawbacks, notably action value overestimation derived from Q-learning updates. This overestimation arises because the action with the highest value in the Q-network is usually selected in the next state, and the same Q-network is used to select actions and calculate action values.…”

Section: Introductionmentioning

confidence: 99%

Mobile Robot Navigation Based on Noisy N-Step Dueling Double Deep Q-Network and Prioritized Experience Replay

Hu,

Zhou,

2024

Electronics

View full text Add to dashboard Cite

Effective real-time autonomous navigation for mobile robots in static and dynamic environments has become a challenging and active research topic. Although the simultaneous localization and mapping (SLAM) algorithm offers a solution, it often heavily relies on complex global and local maps, resulting in significant computational demands, slower convergence rates, and prolonged training times. In response to these challenges, this paper presents a novel algorithm called PER-n2D3QN, which integrates prioritized experience replay, a noisy network with factorized Gaussian noise, n-step learning, and a dueling structure into a double deep Q-network. This combination enhances the efficiency of experience replay, facilitates exploration, and provides more accurate Q-value estimates, thereby significantly improving the performance of autonomous navigation for mobile robots. To further bolster the stability and robustness, meaningful improvements, such as target “soft” updates and the gradient clipping mechanism, are employed. Additionally, a novel and powerful target-oriented reshaping reward function is designed to expedite learning. The proposed model is validated through extensive experiments using the robot operating system (ROS) and Gazebo simulation environment. Furthermore, to more specifically reflect the complexity of the simulation environment, this paper presents a quantitative analysis of the simulation environment. The experimental results demonstrate that PER-n2D3QN exhibits heightened accuracy, accelerated convergence rates, and enhanced robustness in both static and dynamic scenarios.

show abstract

A novel multi-step Q-learning method to improve data efficiency for deep reinforcement learning

Cited by 45 publications

References 19 publications

Q-LBR: Q-Learning Based Load Balancing Routing for UAV-Assisted VANET

Q-LBR: Q-Learning Based Load Balancing Routing for UAV-Assisted VANET

Deep Reinforcement Learning for Stock Prediction

Mobile Robot Navigation Based on Noisy N-Step Dueling Double Deep Q-Network and Prioritized Experience Replay

Contact Info

Product

Resources

About