Estimating the Maximum Expected Value in Continuous Reinforcement Learning Problems

D’Eramo, Carlo; Nuara, Alessandro; Pirotta, Matteo; Restelli, Marcello

doi:10.1609/aaai.v31i1.10887

Cited by 8 publications

(3 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Notice that both Seq(GP-UCB-SW) and Seq(GP-UCB-CD) provide an unbiased estimate of the maximum (or maximum and minimum) contaminant concentrations and temporal location. Such estimates are obtained for each monitoring day through a Monte Carlo approach drawing 100 GP realizations to estimate the probability that each time instant corresponds to the maximum (or minimum) contaminant concentration and using those probabilities to perform a weighted average over the concentrations used to train the GP, similar to what was proposed by D'Eramo et al 40…”

Section: Methodsmentioning

confidence: 99%

Automatic optimization of temporal monitoring schemes dealing with daily water contaminant concentration patterns

Gabrielli

Trovò

Antonelli

2022

Environ. Sci.: Water Res. Technol.

View full text Add to dashboard Cite

show abstract

Section: Methodsmentioning

confidence: 99%

Automatic optimization of temporal monitoring schemes dealing with daily water contaminant concentration patterns

Gabrielli

Trovò

Antonelli

2022

Environ. Sci.: Water Res. Technol.

View full text Add to dashboard Cite

show abstract

“…Since underestimation bias is not preferable (Hasselt, 2010;Lan et al, 2020), Weighted Q-learning proposes (D'Eramo et al, 2016;Zhang et al, 2017) the weighted estimator for the maximal action value based on a weighted average of estimated actions values. However, the weights computation is only practical in a tabular setting (D'Eramo et al, 2017). Our work differs from the foregoing in that it proposes a new estimator which could be generalized into the deep Q-learning network setting.…”

Section: Related Workmentioning

confidence: 98%

Anti-Overestimation Dialogue Policy Learning for Task-Completion Dialogue System

Tian¹,

Yin²,

Moens³

2022

Findings of the Association for Computational Linguistics: NAACL 2022

View full text Add to dashboard Cite

A dialogue policy module is an essential part of task-completion dialogue systems. Recently, increasing interest has focused on reinforcement learning (RL)-based dialogue policy. Its favorable performance and wise action decisions rely on an accurate estimation of action values. The overestimation problem is a widely known issue of RL since its estimate of the maximum action value is larger than the ground truth, which results in an unstable learning process and suboptimal policy. This problem is detrimental to RL-based dialogue policy learning. To mitigate this problem, this paper proposes a dynamic partial average estimator (DPAV) of the ground truth maximum action value. DPAV calculates the partial average between the predicted maximum action value and minimum action value, where the weights are dynamically adaptive and problem-dependent. We incorporate DPAV into a deep Q-network as the dialogue policy and. Our method can achieve better or comparable results compared to top baselines on three dialogue datasets of different domains with a lower computational load. In addition, we also theoretically prove the convergence and derive the upper and lower bounds of the bias compared with those of other methods.

show abstract

“…To this end, several variants of Q-learning have been developed to handle these challenges. Estimation bias is considered in [10], [11], and the estimation variance and training stability are examined in [12], [13]. The convergence rate is improved in [14], and Talha Bozkus and Urbashi Mitra are with the Ming Hsieh Department of Electrical and Computer Engineering, University of Southern California, Los Angeles, USA.…”

Section: Introductionmentioning

confidence: 99%

Ensemble Link Learning for Large State Space Multiple Access Communications

Bozkus

Mitra

2022

2022 30th European Signal Processing Conference (EUSIPCO)

View full text Add to dashboard Cite

Reinforcement learning (RL) is a classical tool to solve network control or policy optimization problems in unknown environments. The original Q-learning suffers from performance and complexity challenges across very large networks. Herein, a novel model-free ensemble reinforcement learning algorithm which adapts the classical Q-learning is proposed to handle these challenges for networks which admit Markov decision process (MDP) models. Multiple Q-learning algorithms are run on multiple, distinct, synthetically created and structurally related Markovian environments in parallel; the outputs are fused using an adaptive weighting mechanism based on the Jensen-Shannon divergence (JSD) to obtain an approximately optimal policy with low complexity. The theoretical justification of the algorithm, including the convergence of key statistics and Qfunctions are provided. Numerical results across several network models show that the proposed algorithm can achieve up to 55% less average policy error with up to 50% less runtime complexity than the state-of-the-art Q-learning algorithms. Numerical results validate assumptions made in the theoretical analysis.

show abstract

Estimating the Maximum Expected Value in Continuous Reinforcement Learning Problems

Cited by 8 publications

References 8 publications

Automatic optimization of temporal monitoring schemes dealing with daily water contaminant concentration patterns

Automatic optimization of temporal monitoring schemes dealing with daily water contaminant concentration patterns

Anti-Overestimation Dialogue Policy Learning for Task-Completion Dialogue System

Ensemble Link Learning for Large State Space Multiple Access Communications

Contact Info

Product

Resources

About