Lower Bounds on the Sample Complexity of Exploration in the Multi-armed Bandit Problem

Mannor, Shie; Tsitsiklis, John N.

doi:10.1007/978-3-540-45167-9_31

Cited by 162 publications

(268 citation statements)

References 8 publications

Supporting

Mentioning

261

Contrasting

Unclassified

Order By: Relevance

“…Upper confidence bounds are also central to the design of multi-armed bandit problems in the PAC setting [EDMM06,MT04], where the algorithm's objective is to identify an arm that is ε-optimal with probability at least 1 − δ. Our work adopts a very different feedback model (pairwise comparisons rather than direct observation of payoffs) and a different objective (regret minimization rather than the PAC objective) but there are clear similarities between our IF1 and IF2 algorithms and the Successive Elimination and Median Eliminiation algorithms developed for the PAC setting in [EDMM06].…”

Section: Related Workmentioning

confidence: 99%

The K-armed dueling bandits problem

Yue

Broder

Kleinberg

et al. 2012

Journal of Computer and System Sciences

169

226

View full text Add to dashboard Cite

We study a partial-information online-learning problem where actions are restricted to noisy comparisons between pairs of strategies (also known as bandits). In contrast to conventional approaches that require the absolute reward of the chosen strategy to be quantifiable and observable, our setting assumes only that (noisy) binary feedback about the relative reward of two chosen strategies is available. This type of relative feedback is particularly appropriate in applications where absolute rewards have no natural scale or are difficult to measure (e.g., user-perceived quality of a set of retrieval results, taste of food, product attractiveness), but where pairwise comparisons are easy to make. We propose a novel regret formulation in this setting, as well as present an algorithm that achieves (almost) information-theoretically optimal regret bounds (up to a constant factor).

show abstract

Section: Related Workmentioning

confidence: 99%

The K-armed dueling bandits problem

Yue

Broder

Kleinberg

et al. 2012

Journal of Computer and System Sciences

169

226

View full text Add to dashboard Cite

show abstract

“…The regret of established bandit algorithms such as UCB1 (Auer et al, 2002) is logarithmic in the number of steps, but grows linearly with the number of arms. This is also best possible (Mannor and Tsitsiklis, 2004).…”

Section: Colored Banditsmentioning

confidence: 94%

EXPLOITING SIMILARITY INFORMATION IN REINFORCEMENT LEARNING - Similarity Models for Multi-Armed Bandits and MDPs

Ortner

2010

Proceedings of the 2nd International Conference on Agents and Artificial Intelligence

View full text Add to dashboard Cite

Abstract:This paper considers reinforcement learning problems with additional similarity information. We start with the simple setting of multi-armed bandits in which the learner knows for each arm its color, where it is assumed that arms of the same color have close mean rewards. An algorithm is presented that shows that this color information can be used to improve the dependency of online regret bounds on the number of arms. Further, we discuss to what extent this approach can be extended to the more general case of Markov decision processes. For the simplest case where the same color for actions means similar rewards and identical transition probabilities, an algorithm and a corresponding online regret bound are given. For the general case where transition probabilities of same-colored actions imply only close but not necessarily identical transition probabilities we give upper and lower bounds on the error by action aggregation with respect to the color information. These bounds also imply that the general case is far more difficult to handle.

show abstract

“…This suggests a pac-mdp algorithm can be used to learn the bandit with p(a) := p ⊕ 1,a . We then make use of a theorem of Mannor and Tsitsiklis on bandit sample-complexity [MT04] to show that with high probability the number of times a * is not selected is at least…”

Section: Fig 1 Hard Mdpmentioning

confidence: 99%

Near-optimal PAC bounds for discounted MDPs

Lattimore

Hutter

2014

Theoretical Computer Science

View full text Add to dashboard Cite

Abstract. We study upper and lower bounds on the sample-complexity of learning near-optimal behaviour in finite-state discounted Markov Decision Processes (mdps). We prove a new bound for a modified version of Upper Confidence Reinforcement Learning (ucrl) with only cubic dependence on the horizon. The bound is unimprovable in all parameters except the size of the state/action space, where it depends linearly on the number of non-zero transition probabilities. The lower bound strengthens previous work by being both more general (it applies to all policies) and tighter. The upper and lower bounds match up to logarithmic factors provided the transition matrix is not too dense.

show abstract

Lower Bounds on the Sample Complexity of Exploration in the Multi-armed Bandit Problem

Cited by 162 publications

References 8 publications

The K-armed dueling bandits problem

The K-armed dueling bandits problem

EXPLOITING SIMILARITY INFORMATION IN REINFORCEMENT LEARNING - Similarity Models for Multi-Armed Bandits and MDPs

Near-optimal PAC bounds for discounted MDPs

Contact Info

Product

Resources

About