UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem

Auer, Péter; Ortner, Ronald

doi:10.1007/s10998-010-3055-6

Cited by 238 publications

(222 citation statements)

References 9 publications

Supporting

Mentioning

217

Contrasting

Unclassified

Order By: Relevance

“…Given K ≥ 2 routes and sequences r i,1 , r i,2 ... of unknown rewards associated with each route i = 1, ..., K, at each trial, t = 1, ..., n, players select a route I t and receive the associated reward r It,t . Let r * i,t be the best reward possible from route i on trial t ; (Auer & Ortner, 2010). The regret after n plays I 1 , ..., I n is defined by…”

Section: Methods 1: Absolute Regretmentioning

confidence: 99%

“…The player is simultaneously attempting to improve their estimate of the options and optimize this knowledge to maximize their score. Understanding the theory and leveraging published work on N -arm Bandits gives insight to available methods to understanding military decision-making (Audibert et al, 2007;Auer et al, 2002;Auer & Ortner, 2010).…”

Section: N -Arm Bandit Problemmentioning

confidence: 99%

“…The goal in a multi-armed bandit problem is to maximize the total payoff obtained in a sequence of allocations (Lai & Robbins, 1985). The problem is often described as a sequential allocation problem, sequential sampling problem, or sequential decision-making problem and was inspired by the problem of a gambler facing a collection of slot machines, each with a different and initially unknown probability of winning along with an equally unknown payout or reward (Auer et al, 2002;Szepesvari, 2010;Auer & Ortner, 2010). The player only receives information regarding each arm by playing the arm and collecting an observation.…”

Section: Mechanics Of Optimal Decision-makingmentioning

confidence: 99%

See 2 more Smart Citations

Understanding Optimal Decision-Making in Wargaming

Nesbitt¹,

Kennedy²,

Alt³

et al. 2013

View full text Add to dashboard Cite

The public reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing the burden, This research aims to gain insight into optimal wargaming decision-making mechanisms using neurophysiological measures by investigating whether brain activation and visual scan patterns predict attention, perception, and/or decision-making errors through human-in-the-loop wargaming simulation experiments. We investigate whether brain activity and visual scan patterns can explain optimal wargaming decision making and its development with a within-person design; i.e., the transition from exploring the environment to exploiting the environment. We describe ongoing research that uses neurophysiological predictors in two military decision making tasks that tap reinforcement learning and cognitive flexibility.

show abstract

Section: Methods 1: Absolute Regretmentioning

confidence: 99%

Section: N -Arm Bandit Problemmentioning

confidence: 99%

Section: Mechanics Of Optimal Decision-makingmentioning

confidence: 99%

See 1 more Smart Citation

Understanding Optimal Decision-Making in Wargaming

Nesbitt¹,

Kennedy²,

Alt³

et al. 2013

View full text Add to dashboard Cite

show abstract

“…All these models propose selection policies to minimize the agent's regret: the difference between the reward it obtained and how much it could had won if it had always pulled the best arm. All this policies, such as UCB, Poker, ε-greedy [19,20], are a compromise between pulling the arm which has the best expected reward and pulling another arm in order to increase the agent's knowledge on the reward distributions (known as the exploration -exploitation compromise). In this paper, we propose to draw an analogy between both problems: the selection of agents evaluated by a reputation value and the selection of arms evaluated by an estimated reward function.…”

Section: Related Workmentioning

confidence: 99%

Multi-Armed Bandit Policies for Reputation Systems

Vallée

Bonnet

Bourdon

2014

Advances in Practical Applications of Heterogeneous Multi-Agent Systems. The PAAMS Collection

View full text Add to dashboard Cite

Abstract. The robustness of reputation systems against manipulations have been widely studied. However, the study of how to use the reputation values computed by those systems are rare. In this paper, we draw the analogy between reputation systems and multi-armed bandit problems. We investigate how to use the multi-armed bandit selection policies in order to increase the robustness of reputation systems against malicious agents. To this end, we propose a model of an abstract service sharing system which uses such a bandit-based reputation system. Finally, in an empirical study, we show that some multi-armed bandits policies are more robust against manipulations but cost-free for the malicious agents whereas some other policies are manipulable but costly.

show abstract

“…However, the learning setting is beyond the scope of this dissertation. Please refer to (Auer and Ortner, 2010;Kuleshov and Precup, 2014) for an overview of BP learning algorithms, and to (Drugan and Nowé, 2013;Yahyaa et al, 2014) for MOBP algorithms.…”

mentioning

confidence: 99%

Multi-objective decision-theoretic planning

Roijers

2016

AI Matters

View full text Add to dashboard Cite

Decision making is hard. It o en requires reasoning about uncertain environments, partial observability and action spaces that are too large to enumerate. In such complex decisionmaking tasks decision-theoretic agents, that can reason about their environments on the basis of mathematical models and produce policies that optimize the utility for their users, can o en assist us.In most research on decision-theoretic agents, the desirability of actions and their e ects is codi ed in a scalar reward function. However, many real-world decision problems have multiple objectives. In such cases the problem is more naturally expressed using a vector-valued reward function. Rather than having a single optimal policy, we then want to produce a set of policies that covers all possible preferences between the objectives. We call such a set a coverage set. In this dissertation, we focus on decision-theoretic planning algorithms that produce the convex coverage set (CCS), which is the optimal solution set when either: 1) the user utility can be expressed as a weighted sum over the values for each objective; or 2) policies can be stochastic.We propose new methods based on two popular approaches to creating planning algorithms that produce an (approximate) CCS by building on an existing single-objective algorithm. In the inner loop approach, we replace the summations and maximizations in the inner most loops of the single-objective algorithm by cross-sums and pruning operations. In the outer loop approach, we solve a multi-objective problem as a series of scalarized problems by employing the single-objective method as a subroutine.Our most important contribution is an outer loop framework that we call optimistic linear support (OLS). As an outer loop method OLS builds the CCS incrementally. We show that, contrary to existing outer loop methods, each intermediate result is a bounded approximation of the CCS with known bounds (even when the single-objective method used is a bounded approximate method as well) and is guaranteed to terminate in a nite number of iterations.We apply OLS-based algorithms to a variety of multi-objective decision problems, and show that it is more memory-e cient, and faster than corresponding inner loop algorithms for moderate numbers of objectives. We show that exchanging subroutines in OLS is relatively easy and illustrate the importance on a complex planning problem. Finally, we show that it is o en possible to reuse parts of the policies and values, found in earlier iterations of OLS, to hot-start later iterations of OLS. Using this last insight, we propose the rst method for multi-objective POMDPs that employs point-based planning and can produce an ε-CCS in reasonable time.Overall, the methods we propose bring us closer to truly practical multi-objective decisiontheoretic planning.

show abstract

UCB revisited: Improved regret bounds for the stochastic multi-armed bandit problem

Cited by 238 publications

References 9 publications

Understanding Optimal Decision-Making in Wargaming

Understanding Optimal Decision-Making in Wargaming

Multi-Armed Bandit Policies for Reputation Systems

Multi-objective decision-theoretic planning

Contact Info

Product

Resources

About