GAN Q-learning

Doan, Thang; Mazoure, Bogdan; Lyle, Clare

doi:10.48550/arxiv.1805.04874

Cited by 6 publications

(7 citation statements)

References 8 publications

(10 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Concurrently, and independently from us, Doan et al (2018) showed a similar equivalence between the distributional Bellman equation and GANs, and used it to develop a GAN Q-learning algorithm. Compared to that work, which did not show any significant improvement of GAN Q-learning over conventional DiRL methods, we show that the GAN approach can be used to tackle multivariate rewards, and use it to develop a novel exploration strategy.…”

Section: Related Workmentioning

confidence: 91%

Distributional Multivariate Policy Evaluation and Exploration with the Bellman GAN

Freirich¹,

Meir²,

Tamar³

2018

Preprint

View full text Add to dashboard Cite

The recently proposed distributional approach to reinforcement learning (DiRL) is centered on learning the distribution of the reward-to-go, often referred to as the value distribution. In this work, we show that the distributional Bellman equation, which drives DiRL methods, is equivalent to a generative adversarial network (GAN) model. In this formulation, DiRL can be seen as learning a deep generative model of the value distribution, driven by the discrepancy between the distribution of the current value, and the distribution of the sum of current reward and next value. We use this insight to propose a GAN-based approach to DiRL, which leverages the strengths of GANs in learning distributions of highdimensional data. In particular, we show that our GAN approach can be used for DiRL with multivariate rewards, an important setting which cannot be tackled with prior methods. The multivariate setting also allows us to unify learning the distribution of values and state transitions, and we exploit this idea to devise a novel exploration method that is driven by the discrepancy in estimating both values and states.

show abstract

Section: Related Workmentioning

confidence: 91%

Distributional Multivariate Policy Evaluation and Exploration with the Bellman GAN

Freirich¹,

Meir²,

Tamar³

2018

Preprint

View full text Add to dashboard Cite

show abstract

“…Traditional fusion of RL and GAN method [9] pays more attention to improving the efficiency of imitation rather than preserving the useful information as described in this paper. In brief, we propose a novel perspective for the application of GAN, which also yields a reliable and effective result.…”

Section: Introductionmentioning

confidence: 99%

Tiyuntsong: A Self-Play Reinforcement Learning Approach for ABR Video Streaming

Huang

Yao

et al. 2019

2019 IEEE International Conference on Multimedia and Expo (ICME)

View full text Add to dashboard Cite

Existing reinforcement learning (RL)-based adaptive bitrate (ABR) approaches outperform the previous fixed control rules based methods by improving the Quality of Experience (QoE) score, as the QoE metric can hardly provide clear guidance for optimization, finally resulting in the unexpected strategies. In this paper, we propose Tiyuntsong, a selfplay reinforcement learning approach with generative adversarial network (GAN)-based method for ABR video streaming. Tiyuntsong learns strategies automatically by training two agents who are competing against each other. Note that the competition results are determined by a set of rules rather than a numerical QoE score that allows clearer optimization objectives. Meanwhile, we propose GAN Enhancement Module to extract hidden features from the past status for preserving the information without the limitations of sequence lengths. Using testbed experiments, we show that the utilization of GAN significantly improves the Tiyuntsong's performance. By comparing the performance of ABRs, we observe that Tiyuntsong also betters existing ABR algorithms in the underlying metrics.Index Terms-Adaptive Bitrate Streaming, Self-play Reinforcement learning 1 Tinyuntsong: Also named as Cloud Ascending Ladder, a qinggong skill in the Chinese wuxia novel The Heaven Sword and Dragon Saber by Jin Yong. The skill enables the user to travel at high speeds and leap to extreme heights by stepping one foot on the other one.

show abstract

“…Bellemare, Dabney, and Munos (2017) use a categorical distribution to keep track of the random returns to bolster exploratory actions. In a similar vein, Doan, Mazoure, and Lyle (2018) rely on a generative model to learn the distribution of state-action values. In that case, approximating the return density with a generator allows to Preprint.…”

Section: Introductionmentioning

confidence: 99%

“…Osband et al (2016) rather rely on an ensemble of neural networks to estimate the uncertainty in the prediction of the value function, allowing to reduce learning times while improving performance. Finally, Doan, Mazoure, and Lyle (2018) consider generative adversarial networks (Goodfellow et al, 2014) to model the distribution of random state-value functions. The current work considers a different approach based on normalizing flows for density estimation.…”

Section: Introductionmentioning

confidence: 99%

Leveraging exploration in off-policy algorithms via normalizing flows

Mazoure,

Doan,

Durand

et al. 2019

Preprint

Self Cite

View full text Add to dashboard Cite

Exploration is a crucial component for discovering approximately optimal policies in most high-dimensional reinforcement learning (RL) settings with sparse rewards. Approaches such as neural density models and continuous exploration (e.g., Go-Explore) have been instrumental in recent advances. Soft actor-critic (SAC) is a method for improving exploration that aims to combine off-policy updates while maximizing the policy entropy. We extend SAC to a richer class of probability distributions through normalizing flows, which we show improves performance in exploration, sample complexity, and convergence. Finally, we show that not only the normalizing flow policy outperforms SAC on MuJoCo domains, it is also significantly lighter, using as low as 5.6% of the original network's parameters for similar performance.

show abstract

GAN Q-learning

Cited by 6 publications

References 8 publications

Distributional Multivariate Policy Evaluation and Exploration with the Bellman GAN

Distributional Multivariate Policy Evaluation and Exploration with the Bellman GAN

Tiyuntsong: A Self-Play Reinforcement Learning Approach for ABR Video Streaming

Leveraging exploration in off-policy algorithms via normalizing flows

Contact Info

Product

Resources

About