Corralling Stochastic Bandit Algorithms

Arora, Raman; Marinov, Teodor V.; Mohri, Mehryar

doi:10.48550/arxiv.2006.09255

Cited by 5 publications

(10 citation statements)

References 2 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Thus if A J outperforms this bound and obtains logarithmic regret, our combiner algorithm will also obtain logarithmic regret, which is not obviously possible using techniques based on the Corral algorithm [8]. Note that this result also appears to improve upon [14] (Theorem 4.2) by removing a log(T ) factor, but this is because we have assumed knowledge of the time horizon T in order to set C i .…”

Section: Gap-dependent Regret Boundsmentioning

confidence: 89%

“…where the third inequality is from equation (13), and the last inequality follows from Lemma 8 Now, t|it=i 2β a t (M i T (i,t)−1 ) −1 a t ≤ C i T (i, T ) α i , where the inequality follows from the second condition for being in I t (14).…”

Section: End If End Formentioning

confidence: 97%

“…As another example of this Theorem in action, let us consider the setting studied by [14]. Specifically, each A i has a putative regret bound of t τ =1 r − r i τ ≤ k i log(t)t for all t ≤ T for some k i , and A J in fact obtains its bound.…”

Section: Gap-dependent Regret Boundsmentioning

confidence: 99%

“…However, the method of [8], even when considered in the stochastic setting as in [9], seems unable to obtain such bounds. Instead, [14] has recently provided a method based on UCB that can achieve such results. Their algorithm is similar to ours, but we devise a somewhat more intricate method that is able to not only obtain the results outlined previously, but also match the logarithmic regret bounds provided by [14].…”

mentioning

confidence: 99%

See 3 more Smart Citations

Upper Confidence Bounds for Combining Stochastic Bandits

Cutkosky,

Das,

Purohit

2020

Preprint

View full text Add to dashboard Cite

We provide a simple method to combine stochastic bandit algorithms. Our approach is based on a "meta-UCB" procedure that treats each of N individual bandit algorithms as arms in a higher-level N -armed bandit problem that we solve with a variant of the classic UCB algorithm. Our final regret depends only on the regret of the base algorithm with the best regret in hindsight. This approach provides an easy and intuitive alternative strategy to the CORRAL algorithm for adversarial bandits, without requiring the stability conditions imposed by CORRAL on the base algorithms. Our results match lower bounds in several settings, and we provide empirical validation of our algorithm on misspecified linear bandit and model selection problems.

show abstract

Section: Gap-dependent Regret Boundsmentioning

confidence: 89%

Section: End If End Formentioning

confidence: 97%

Section: Gap-dependent Regret Boundsmentioning

confidence: 99%

mentioning

confidence: 99%

See 2 more Smart Citations

Upper Confidence Bounds for Combining Stochastic Bandits

Cutkosky,

Das,

Purohit

2020

Preprint

View full text Add to dashboard Cite

show abstract

“…The problem of online model selection for bandit algorithms has received a lot of recent attention, as witnessed by a flurry of recent works (e.g., Agarwal et al [2017], Foster et al [2019], Chatterji et al [2020], , Arora et al [2020], , Foster et al [2020], Lee et al [2020], Bibaut et al [2020], Ghosh et al [2020]).…”

Section: Related Work and Our Contributionmentioning

confidence: 99%

Regret Bound Balancing and Elimination for Model Selection in Bandits and RL

Pacchiano¹,

Dann²,

Gentile³

2020

Preprint

View full text Add to dashboard Cite

We propose a simple model selection approach for algorithms in stochastic bandit and reinforcement learning problems. As opposed to prior work that (implicitly) assumes knowledge of the optimal regret, we only require that each base algorithm comes with a candidate regret bound that may or may not hold during all rounds. In each round, our approach plays a base algorithm to keep the candidate regret bounds of all remaining base algorithms balanced, and eliminates algorithms that violate their candidate bound. We prove that the total regret of this approach is bounded by the best valid candidate regret bound times a multiplicative factor. This factor is reasonably small in several applications, including linear bandits and MDPs with nested function classes, linear bandits with unknown misspecification, and LinUCB applied to linear bandits with different confidence parameters. We further show that, under a suitable gap-assumption, this factor only scales with the number of base algorithms and not their complexity when the number of rounds is large enough. Finally, unlike recent efforts in model selection for linear stochastic bandits, our approach is versatile enough to also cover cases where the context information is generated by an adversarial environment, rather than a stochastic one.

show abstract

Multitask Bandit Learning Through Heterogeneous Feedback Aggregation

Wang,

Zhang,

Singh

et al. 2020

Preprint

View full text Add to dashboard Cite

In many real-world applications, multiple agents seek to learn how to perform highly related yet slightly different tasks in an online bandit learning protocol. We formulate this problem as the -multiplayer multi-armed bandit problem, in which a set of players concurrently interact with a set of arms, and for each arm, the reward distributions for all players are similar but not necessarily identical. We develop an upper confidence bound-based algorithm, RobustAgg( ), that adaptively aggregates rewards collected by different players. In the setting where an upper bound on the pairwise similarities of reward distributions between players is known, we achieve instance-dependent regret guarantees that depend on the amenability of information sharing across players. We complement these upper bounds with nearly matching lower bounds. In the setting where pairwise similarities are unknown, we provide a lower bound, as well as an algorithm that trades off minimax regret guarantees for adaptivity to unknown similarity structure.

show abstract

Corralling Stochastic Bandit Algorithms

Cited by 5 publications

References 2 publications

Upper Confidence Bounds for Combining Stochastic Bandits

Upper Confidence Bounds for Combining Stochastic Bandits

Regret Bound Balancing and Elimination for Model Selection in Bandits and RL

Multitask Bandit Learning Through Heterogeneous Feedback Aggregation

Contact Info

Product

Resources

About