Multi-Player Bandits Revisited

Besson, Lilian; Kaufmann, Emilie

doi:10.48550/arxiv.1711.02317

Cited by 4 publications

(8 citation statements)

References 20 publications

(46 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In our system, as the agents have uniform rank across all arms there exists a unique stable matching (which is not true when agents are ranked non-uniformly across arms.) Indeed, in any stable match agent j must match with it's most preferred arm which is not matched with any agent with rank (j − 1) or higher 5 .…”

Section: Problem Settingmentioning

confidence: 99%

“…Thus, agents cannot infer if its actions cause a collision to other higher The arm-means for sub-optimal arms for each agent are chosen i.i.d. uniformly over [0, 0.8], while the arm-mean of i ∈ [5] for agent i was set to 0.9. The rewards are binary.…”

Section: Comparison With Regret Bounds For Related Modelsmentioning

confidence: 99%

“…Figure3: Simulations on a system with 5 agents and 7 arms. For each agent i ∈[5], a permutation over the arms σ i was chosen, and the arm-means are equally spaced among the 7 arms from 0.1 to 0.9 in the increasing order of permutation. This is thus not a OSB instance.…”

mentioning

confidence: 99%

“…This is done for convenience. Our analysis can be easily adapted to any sub-gaussian reward 5. Agents ranked 1 through j − 1…”

mentioning

confidence: 99%

See 3 more Smart Citations

Dominate or Delete: Decentralized Competing Bandits in Serial Dictatorship

Sankararaman,

Basu,

Sankararaman

2020

Preprint

View full text Add to dashboard Cite

We study regret minimization problems in a two-sided matching market where uniformly valued demand side agents (a.k.a. agents) continuously compete for getting matched with supply side agents (a.k.a. arms) with unknown and heterogeneous valuations. Such markets abstract online matching platforms (for e.g. UpWork, TaskRabbit) and falls within the purview of matching bandit models introduced in Liu et al. [24]. The uniform valuation in the demand side admits a unique stable matching equilibrium in the system. We design the first decentralized algorithm -UCB with Decentralized Dominant-arm Deletion (UCB-D3), for matching bandits under uniform valuation that does not require any knowledge of reward gaps or time horizon, and thus partially resolves an open question in [24]. UCB-D3 works in phases of exponentially increasing length. In each phase i, an agent first deletes dominated arms -the arms preferred by agents ranked higher than itself. Deletion follows dynamic explore-exploit using UCB algorithm on the remaining arms for 2 i rounds. Finally, the preferred arm is broadcast in a decentralized fashion to other agents through pure exploitation in (N − 1)K rounds with N agents and K arms. Comparing the obtained reward with respect to the unique stable matching, we show that UCB-D3 achieves O(log(T )/∆ 2 ) regret in T rounds, where ∆ is the minimum gap across all agents and arms. We provide a (orderwise) matching regret lower-bound.

show abstract

Section: Problem Settingmentioning

confidence: 99%

Section: Comparison With Regret Bounds For Related Modelsmentioning

confidence: 99%

mentioning

confidence: 99%

“…This is done for convenience. Our analysis can be easily adapted to any sub-gaussian reward 5. Agents ranked 1 through j − 1…”

mentioning

confidence: 99%

See 2 more Smart Citations

Dominate or Delete: Decentralized Competing Bandits in Serial Dictatorship

Sankararaman,

Basu,

Sankararaman

2020

Preprint

View full text Add to dashboard Cite

show abstract

“…Due to decentralized and simultaneous policy selection, collisions have to be taken into consideration when more than one player happen to choose the same arm. For this situation, a number of studies [21]- [23] assume that no player receives any reward, while some other studies [20] assume that the colliding players split the reward over the single arm in an arbitrary way. Such a model is frequently used to describe the user-channel matching problem in a CRN, where the channel condition is modeled to be stochastic, partially due to the unpredictable activities of the primary users [16], [25].…”

Section: B Mp-mab For Resource Allocation In Wireless Networkmentioning

confidence: 99%

Decentralized Learning for Channel Allocation in IoT Networks over Unlicensed Bandwidth as a Contextual Multi-player Multi-armed Bandit Game

Wang,

Leshem,

Niyato

et al. 2020

Preprint

View full text Add to dashboard Cite

We study a decentralized channel allocation problem in an ad-hoc Internet of Things (IoT) network underlaying on a spectrum licensed to an existing wireless network. In the considered IoT network, the impoverished computation capability and the limited antenna number on the IoT devices make them difficult to acquire the Channel State Information (CSI) for the multi-channels over the shared spectrum. In addition, in practice, the unknown patterns of the licensed users' transmission activities and the time-varying CSI due to fast fading or mobility of the IoT devices can also cause stochastic changes in the channel quality. Therefore, decentralized IoT links are expected to learn their channel statistics online based on the partial observations, while acquiring no information about the channels that they are not operating on. Meanwhile, they also have to reach an efficient, collision-free solution of channel allocation on the basis of limited coordination or message exchange. Our study maps this problem into a contextual multi-player, multi-arm bandit game, for which we propose a purely decentralized, three-stage policy learning algorithm through trial-and-error. Our theoretical analysis shows that the proposed learning algorithm guarantees the IoT devices to jointly converge to the social-optimal channel allocation with a sub-linear (i.e., polylogarithmic) regret with respect to the operational time. Simulation results demonstrate that the proposed algorithm strikes a good balance between efficient channel allocation and network scalability when compared with the other state-of-the-art distributed multiarmed bandit algorithms.

show abstract

Decentralized Heterogeneous Multi-Player Multi-Armed Bandits With Non-Zero Rewards on Collisions

Magesh

Veeravalli

2022

IEEE Trans. Inform. Theory

View full text Add to dashboard Cite

We consider a fully decentralized multi-player stochastic multi-armed bandit setting where the players cannot communicate with each other and can observe only their own actions and rewards. The environment may appear differently to different players, i.e., the reward distributions for a given arm are heterogeneous across players. In the case of a collision (when more than one player plays the same arm), we allow for the colliding players to receive non-zero rewards. The time-horizon T for which the arms are played is not known to the players. Within this setup, where the number of players is allowed to be greater than the number of arms, we present a policy that achieves near order-optimal expected regret of order O(log 1+δ T ) for δ > 0 (however small) over a time-horizon of duration T .

show abstract

Multi-Player Bandits Revisited

Cited by 4 publications

References 20 publications

Dominate or Delete: Decentralized Competing Bandits in Serial Dictatorship

Dominate or Delete: Decentralized Competing Bandits in Serial Dictatorship

Decentralized Learning for Channel Allocation in IoT Networks over Unlicensed Bandwidth as a Contextual Multi-player Multi-armed Bandit Game

Decentralized Heterogeneous Multi-Player Multi-Armed Bandits With Non-Zero Rewards on Collisions

Contact Info

Product

Resources

About