Regret Analysis of the Anytime Optimally Confident UCB Algorithm

Lattimore, Tor

doi:10.48550/arxiv.1603.08661

Cited by 2 publications

(2 citation statements)

References 4 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Theoretically, the epistemic uncertainty enables to converge to zero in tabular [193] and linear MDPs [194], [195] according to the theoretical results. In general MDPs, as the agent learns more about the environment, the uncertainty that encourages exploration gradually decreases to zero, then the confidence set of the MDP posterior will contain the true MDP with a high probability [196], [197].…”

Section: B Open Problemsmentioning

confidence: 99%

Exploration in Deep Reinforcement Learning: A Comprehensive Survey

Yang¹,

Tang²,

Bai³

et al. 2021

Preprint

View full text Add to dashboard Cite

Deep Reinforcement Learning (DRL) and Deep Multi-agent Reinforcement Learning (MARL) have achieved significant success across a wide range of domains, including game AI, autonomous vehicles, robotics, finance, healthcare, transportation and so on. However, DRL and deep MARL agents are widely known to be sample-inefficient and millions of interactions are usually needed even for relatively simple game settings, thus preventing the wide application and deployment in real-industry scenarios. One bottleneck challenge behind is the well-known exploration problem, i.e., how to efficiently explore the unknown environments and collect informative experiences that could benefit the policy learning most towards optimal ones.

show abstract

Section: B Open Problemsmentioning

confidence: 99%

Exploration in Deep Reinforcement Learning: A Comprehensive Survey

Yang¹,

Tang²,

Bai³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…MOSS (Audibert and Bubeck, 2010) makes the confidence bound depend on the number of plays for each bandit by replacing log(t) with log(t/N i (t)) in Eq. 4, and policies similar to MOSS include OCUCB (Lattimore, 2016) and UCB* (Garivier et al, 2016). UCB † (Lattimore, 2018) improves upon the previous ones significantly by designing a more advanced log function component.…”

Section: Related Workmentioning

confidence: 99%

Tuning Confidence Bound for Stochastic Bandits with Bandit Distance

Zhang,

Das,

Kreutz-Delgado

2021

Preprint

View full text Add to dashboard Cite

We propose a novel modification of the standard upper confidence bound (UCB) method for the stochastic multi-armed bandit (MAB) problem which tunes the confidence bound of a given bandit based on its distance to others. Our UCB distance tuning (UCB-DT) formulation enables improved performance as measured by expected regret by preventing the MAB algorithm from focusing on non-optimal bandits which is a well-known deficiency of standard UCB. "Distance tuning" of the standard UCB is done using a proposed distance measure, which we call bandit distance, that is parameterizable and which therefore can be optimized to control the transition rate from exploration to exploitation based on problem requirements. We empirically demonstrate increased performance of UCB-DT versus many existing state-of-the-art methods which use the UCB formulation for the MAB problem. Our contribution also includes the development of a conceptual tool called the Exploration Bargain Point which gives insights into the tradeoffs between exploration and exploitation. We argue that the Exploration Bargain Point provides an intuitive perspective that is useful for comparatively analyzing the performance of UCB-based methods.

show abstract

Regret Analysis of the Anytime Optimally Confident UCB Algorithm

Cited by 2 publications

References 4 publications

Exploration in Deep Reinforcement Learning: A Comprehensive Survey

Exploration in Deep Reinforcement Learning: A Comprehensive Survey

Tuning Confidence Bound for Stochastic Bandits with Bandit Distance

Contact Info

Product

Resources

About