Generally, the multi-armed has been studied under the setting that at each time step over an infinite horizon a controller chooses to activate a single process or bandit out of a finite collection of independent processes (statistical experiments, populations, etc.) for a single period, receiving a reward that is a function of the activated process, and in doing so advancing the chosen process. Classically, rewards are discounted by a constant factor β ∈ (0, 1) per round.In this paper, we present a solution to the problem, with potentially non-Markovian, uncountable state space reward processes, under a framework in which, first, the discount factors may be non-uniform and vary over time, and second, the periods of activation of each bandit may be not be fixed or uniform, subject instead to a possibly stochastic duration of activation before a change to a different bandit is allowed. The solution is based on generalized restart-in-state indices, and it utilizes a view of the problem not as "decisions over state space" but rather "decisions over time".
Consider the problem of a controller sampling sequentially from a finite number of N ≥ 2 populations, specified by random variables X i k , i = 1, . . . , N, and k = 1, 2, . . .; where X i k denotes the outcome from population i the k th time it is sampled. It is assumed that for each fixed i, {X i k } k≥1 is a sequence of i.i.d. uniform random variables over some interval [ai, bi], with the support (i.e., ai, bi) unknown to the controller. The objective is to have a policy π for deciding, based on available data, from which of the N populations to sample from at any time n = 1, 2, . . . so as to maximize the expected sum of outcomes of n samples or equivalently to minimize the regret due to lack on information of the parameters {ai} and {bi}. In this paper, we present a simple UCB-type policy that is asymptotically optimal. Additionally, finite horizon regret bounds are given.
The purpose of this paper is to provide further understanding into the structure of the sequential allocation ("stochastic multi-armed bandit", or MAB) problem by establishing probability one finite horizon bounds and convergence rates for the sample (or "pseudo") regret associated with two simple classes of allocation policies π.For any slowly increasing function g, subject to mild regularity constraints, we construct two policies (the g-Forcing, and the g-Inflated Sample Mean) that achieve a measure of regret of order O(g(n)) almost surely as n → ∞, bound from above and below. Additionally, almost sure upper and lower bounds on the remainder term are established. In the constructions herein, the function g effectively controls the "exploration" of the classical "exploration/exploitation" tradeoff.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.