Metatrace Actor-Critic: Online Step-size Tuning by Meta-gradient Descent for Reinforcement Learning Control

Young, Kenny; Wang, Baoxiang; Taylor, Matthew E.

doi:10.48550/arxiv.1805.04514

Cited by 5 publications

(5 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, this method and its variations were limited to linear supervised learning. The recent movement is to combine incremental Delta-Bar-Delta method and Temporal-difference learning [28]. Those methods can not make tuning hyperparameters online and allow the algorithm to more robustly adjust to non-stationarity in a problem at the same time.…”

Section: Related Workmentioning

confidence: 99%

A Framework for History-Aware Hyperparameter Optimisation in Reinforcement Learning

Parra-Ullauri¹,

Жен²,

García-Domínguez³

et al. 2023

Preprint

View full text Add to dashboard Cite

A Reinforcement Learning (RL) system depends on a set of initial conditions (hyperparameters) that affect the system's performance. However, defining a good choice of hyperparameters is a challenging problem. Hyperparameter tuning often requires manual or automated searches to find optimal values. Nonetheless, a noticeable limitation is the high cost of algorithm evaluation for complex models, making the tuning process computationally expensive and time-consuming. In this paper, we propose a framework based on integrating complex event processing and temporal models, to alleviate these trade-offs. Through this combination, it is possible to gain insights about a running RL system efficiently and unobtrusively based on data stream monitoring and to create abstract representations that allow reasoning about the historical behaviour of the RL system. The obtained knowledge is exploited to provide feedback to the RL system for optimising its hyperparameters while making effective use of parallel resources. We introduce a novel history-aware epsilon-greedy logic for hyperparameter optimisation that instead of using static hyperparameters that are kept fixed for the whole training, adjusts the hyperparameters at runtime based on the analysis of the agent's performance over time windows in a single agent's lifetime. We tested the proposed approach in a 5G mobile communications case study that uses DQN, a variant of RL, for its decision-making. Our experiments demonstrated the effects of hyperparameter tuning using history on training stability and reward values. The encouraging results show that the proposed history-aware framework significantly improved performance compared to traditional hyperparameter tuning approaches.

show abstract

Section: Related Workmentioning

confidence: 99%

A Framework for History-Aware Hyperparameter Optimisation in Reinforcement Learning

Parra-Ullauri¹,

Жен²,

García-Domínguez³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Meta Reinforcement Learning Some methods are using the meta-objective (usually the difference of episode return ) for reinforcement learning. Meta-knowledge is used to construct loss Zhou et al (2020b); Veeriah et al (2019), reward Jaderberg et al (2019); Zheng et al (2018), or hyperparameters Xu et al (2018b); Young et al (2018). In Xu et al (2018a) a teacher-student strategy is proposed for global exploration.…”

Section: Ablation Studymentioning

confidence: 99%

Credit Assignment with Meta-Policy Gradient for Multi-Agent Reinforcement Learning

Shao,

Zhang,

Jiang

et al. 2021

Preprint

View full text Add to dashboard Cite

Reward decomposition is a critical problem in centralized training with decentralized execution (CTDE) paradigm for multi-agent reinforcement learning. To take full advantage of global information, which exploits the states from all agents and the related environment for decomposing Q values into individual credits, we propose a general meta-learning-based Mixing Network with Meta Policy Gradient (MNMPG) framework to distill the global hierarchy for delicate reward decomposition. The excitation signal for learning global hierarchy is deduced from the episode reward difference between before and after "exercise updates" through the utility network. Our method is generally applicable to the CTDE method using a monotonic mixing network. Experiments on the StarCraft II micromanagement benchmark demonstrate that our method just with a simple utility network is able to outperform the current state-of-the-art MARL algorithms on 4 of 5 super hard scenarios. Better performance can be further achieved when combined with a role-based utility network. IntroductionMulti-agent deep reinforcement learning algorithms (MARL) have recently shown extraordinary performance in various games like DOTA2 Berner et al. (2019), StarCraft Samvelyan et al. (2019), and Honor of Kings Ye et al. (2020). The framework of centralized training with decentralized execution (CTDE) Gupta et al. (2017); Rashid et al. (2018), which enjoys the advantages of joint action learning Claus & Boutilier (1998) and independent learning Tan (1993), is one of the popular frameworks for solving collaborative multi-agent tasks. Recent research on the CTDE framework can be divided into two categories: One is to enhance agents' ability with emphasis on individually processing local observations, for instance, allocating a role Wang et al. (2020b,c) or a mode of exploration Mahajan et al. (2019) to each agent, which may need extra prior information and consume more computational resources on decentralized execution due to extensive inference network. Another aims to decompose the single reward to each agent accurately, in other words, training a delicate mixing network for local utility values Rashid et al. (2018); Yang et al. (2020); Wang et al. (2020a), or training a new joint action-value to factorizing tasks Son et al. (2019). The latter's empirical performance on challenging tasks is relatively limited by training instability and demand for delicate parameter adjustment.The current methods on mixing network usually utilize the global information from the full state roughly, such as Rashid et al. ( 2018) which only takes the full state vector as the input of the mixing network. These methods lack an explicit hierarchy, the distilled information from the full state, to decompose Q values into individual credit. Inspired by Meta-DDPG Xu et al. (2018a), where meta-learning is adopted for exploration, in this paper, we present a general meta-learning-based framework called Mixing Network with Meta Policy Gradient (MNMPG) for exploration on better Preprint. Under revi...

show abstract

“…This algorithm used no experience replay or multiple parallel actors, which we refer to as the incremental-online setting. AC(λ) has been previously applied to the ALE in the work of Young, Wang, and Taylor (2018). For AC(λ), we used a similar architecture to the one used in our DQN experiments, except that we replaced the ReLU activation functions with the SiLU and dSiLU activation functions introduced by Elfwing, Uchibe, and Doya (2018).…”

Section: Actor-critic With Eligibility Tracesmentioning

confidence: 99%

MinAtar: An Atari-Inspired Testbed for Thorough and Reproducible Reinforcement Learning Experiments

Young,

Tian

2019

Preprint

Self Cite

View full text Add to dashboard Cite

The Arcade Learning Environment (ALE) is a popular platform for evaluating reinforcement learning agents. Much of the appeal comes from the fact that Atari games demonstrate aspects of competency we expect from an intelligent agent and are not biased toward any particular solution approach. The challenge of the ALE includes (1) the representation learning problem of extracting pertinent information from raw pixels, and (2) the behavioural learning problem of leveraging complex, delayed associations between actions and rewards. Often, the research questions we are interested in pertain more to the latter, but the representation learning problem adds significant computational expense. We introduce MinAtar, short for miniature Atari, a new set of environments that capture the general mechanics of specific Atari games while simplifying the representational complexity to focus more on the behavioural challenges. MinAtar consists of analogues of five Atari games: Seaquest, Breakout, Asterix, Freeway and Space Invaders. Each MinAtar environment provides the agent with a 10 × 10 × n binary state representation. Each game plays out on a 10 × 10 grid with n channels corresponding to gamespecific objects, such as ball, paddle and brick in the game Breakout. To investigate the behavioural challenges posed by MinAtar, we evaluated a smaller version of the DQN architecture as well as online actor-critic with eligibility traces. With the representation learning problem simplified, we can perform experiments with significantly less computational expense. In our experiments, we use the saved compute time to perform step-size parameter sweeps and more runs than is typical for the ALE. Experiments like this improve reproducibility, and allow us to draw more confident conclusions. We hope that MinAtar can allow researchers to thoroughly investigate behavioural challenges similar to those inherent in the ALE. MotivationThe Arcade Learning Environment (Bellemare, Naddaf, Veness, & Bowling, 2013) (ALE) has become widely popular as a testbed for reinforcement learning (RL), and other artificial intelligence algorithms, due in part to the following features:• Nuanced exploration challenges: Atari games provide varying degrees of exploration challenge. In Seaquest, for example, an agent must learn to collect divers and go up for air to avoid running out of oxygen. Neither of these provide immediate reward, but failing to learn this will limit the return to what can be obtained in the time it takes for oxygen to run out. In Breakout, a strong strategy involves trapping the ball between the ceiling and the Preprint. Under review.

show abstract

Metatrace Actor-Critic: Online Step-size Tuning by Meta-gradient Descent for Reinforcement Learning Control

Cited by 5 publications

References 0 publications

A Framework for History-Aware Hyperparameter Optimisation in Reinforcement Learning

A Framework for History-Aware Hyperparameter Optimisation in Reinforcement Learning

Credit Assignment with Meta-Policy Gradient for Multi-Agent Reinforcement Learning

MinAtar: An Atari-Inspired Testbed for Thorough and Reproducible Reinforcement Learning Experiments

Contact Info

Product

Resources

About