On No-Sensing Adversarial Multi-Player Multi-Armed Bandits With Collision Communications

Shi, Chengshuai; Shen, Cong

doi:10.1109/jsait.2021.3076027

Cited by 9 publications

(7 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Decentralized multi-player multi-armed bandit (Dec-MPMAB) problems [24] extend the traditional multi-armed bandit (MAB) framework to encompass situations where multiple players interact with a shared set of arms or actions. In the Dec-MPMAB framework, multiple players engage in decision making simultaneously, and they may either compete or cooperate in the allocation of limited resources.…”

Section: Problem Formulationmentioning

confidence: 99%

Distributed Data-Driven Learning-Based Optimal Dynamic Resource Allocation for Multi-RIS-Assisted Multi-User Ad-Hoc Network

Zhang,

2024

Algorithms

View full text Add to dashboard Cite

This study investigates the problem of decentralized dynamic resource allocation optimization for ad-hoc network communication with the support of reconfigurable intelligent surfaces (RIS), leveraging a reinforcement learning framework. In the present context of cellular networks, device-to-device (D2D) communication stands out as a promising technique to enhance the spectrum efficiency. Simultaneously, RIS have gained considerable attention due to their ability to enhance the quality of dynamic wireless networks by maximizing the spectrum efficiency without increasing the power consumption. However, prevalent centralized D2D transmission schemes require global information, leading to a significant signaling overhead. Conversely, existing distributed schemes, while avoiding the need for global information, often demand frequent information exchange among D2D users, falling short of achieving global optimization. This paper introduces a framework comprising an outer loop and inner loop. In the outer loop, decentralized dynamic resource allocation optimization has been developed for self-organizing network communication aided by RIS. This is accomplished through the application of a multi-player multi-armed bandit approach, completing strategies for RIS and resource block selection. Notably, these strategies operate without requiring signal interaction during execution. Meanwhile, in the inner loop, the Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm has been adopted for cooperative learning with neural networks (NNs) to obtain optimal transmit power control and RIS phase shift control for multiple users, with a specified RIS and resource block selection policy from the outer loop. Through the utilization of optimization theory, distributed optimal resource allocation can be attained as the outer and inner reinforcement learning algorithms converge over time. Finally, a series of numerical simulations are presented to validate and illustrate the effectiveness of the proposed scheme.

show abstract

Section: Problem Formulationmentioning

confidence: 99%

Distributed Data-Driven Learning-Based Optimal Dynamic Resource Allocation for Multi-RIS-Assisted Multi-User Ad-Hoc Network

Zhang,

2024

Algorithms

View full text Add to dashboard Cite

show abstract

“…The first class allows no information sharing among players, where players sense the presence of other players through experienced collisions (Anandkumar et al 2011). The other class allows information sharing among players, e.g., directly sharing estimated mean rewards of arms (Liu and Zhao 2010b;Kalathil, Nayyar, and Jain 2014;Rosenski, Shamir, and Szlak 2016;Bistritz and Leshem 2018;Besson and Kaufmann 2018;Boursier and Perchet 2019;Mehrabian et al 2020;Wang et al 2020;Bubeck et al 2020;Lugosi and Mehrabian 2021;Hanawal and Darak 2021;Pacchiano, Bartlett, and Jordan 2021;Shi et al 2020). In particular, the regret guarantees for MPMAB were significantly improved in (Boursier and Perchet 2019) compared to the non-information sharing case.…”

Section: Related Workmentioning

confidence: 99%

“…Specifically, we study a "networked information sharing" setting, where all players are arranged in a network G := {N , E}, and each player has limited capacity for sharing information, e.g., their estimates of the arms' mean rewards with her neighbors in G, as inspired by the original idea of utilizing collisions to share sampled arm rewards in MPMAB settings (Boursier and Perchet 2019; Shi et al 2020). To tackle the new dilemma in the presence of walking arms, we present a decentralized algorithm called MPMAB-WA-UCB, which is able to avoid collisions after sufficient exploration, in a decentralized manner, i.e., each player decides which arm to pull independently based on the local available information: the past observed rewards and collisions, along with the received information from neighbor players.…”

Section: Introductionmentioning

confidence: 99%

Decentralized Stochastic Multi-Player Multi-Armed Walking Bandits

Xiong

2023

AAAI

View full text Add to dashboard Cite

Multi-player multi-armed bandit is an increasingly relevant decision-making problem, motivated by applications to cognitive radio systems. Most research for this problem focuses exclusively on the settings that players have full access to all arms and receive no reward when pulling the same arm. Hence all players solve the same bandit problem with the goal of maximizing their cumulative reward. However, these settings neglect several important factors in many real-world applications, where players have limited access to a dynamic local subset of arms (i.e., an arm could sometimes be ``walking'' and not accessible to the player). To this end, this paper proposes a multi-player multi-armed walking bandits model, aiming to address aforementioned modeling issues. The goal now is to maximize the reward, however, players can only pull arms from the local subset and only collect a full reward if no other players pull the same arm. We adopt Upper Confidence Bound (UCB) to deal with the exploration-exploitation tradeoff and employ distributed optimization techniques to properly handle collisions. By carefully integrating these two techniques, we propose a decentralized algorithm with near-optimal guarantee on the regret, and can be easily implemented to obtain competitive empirical performance.

show abstract

“…Second, other than the stochastic reward setting, other MAB variants are also worth exploring in the case of multiple players. For example, adversarial rewards are studied in both collision-sensing [26] and no-sensing [42], [43] setting, which is an interesting future direction for the collision-dependent reward model. Third, it is interesting but also challenging to remove Assumption 1 in general.…”

Section: G Other Extensionsmentioning

confidence: 99%

Multi-player Multi-armed Bandits with Collision-Dependent Reward Distributions

Shi,

Shen

2021

Preprint

Self Cite

View full text Add to dashboard Cite

We study a new stochastic multi-player multi-armed bandits (MP-MAB) problem, where the reward distribution changes if a collision occurs on the arm. Existing literature always assumes a zero reward for involved players if collision happens, but for applications such as cognitive radio, the more realistic scenario is that collision reduces the mean reward but not necessarily to zero. We focus on the more practical no-sensing setting where players do not perceive collisions directly, and propose the Error-Correction Collision Communication (EC3) algorithm that models implicit communication as a reliable communication over noisy channel problem, for which random coding error exponent is used to establish the optimal regret that no communication protocol can beat. Finally, optimizing the tradeoff between code length and decoding error rate leads to a regret that approaches the centralized MP-MAB regret, which represents a natural lower bound. Experiments with practical error-correction codes on both synthetic and real-world datasets demonstrate the superiority of EC3. In particular, the results show that the choice of coding schemes has a profound impact on the regret performance.

show abstract

On No-Sensing Adversarial Multi-Player Multi-Armed Bandits With Collision Communications

Cited by 9 publications

References 23 publications

Distributed Data-Driven Learning-Based Optimal Dynamic Resource Allocation for Multi-RIS-Assisted Multi-User Ad-Hoc Network

Distributed Data-Driven Learning-Based Optimal Dynamic Resource Allocation for Multi-RIS-Assisted Multi-User Ad-Hoc Network

Decentralized Stochastic Multi-Player Multi-Armed Walking Bandits

Multi-player Multi-armed Bandits with Collision-Dependent Reward Distributions

Contact Info

Product

Resources

About