Cooperative Stochastic Multi-agent Multi-armed Bandits Robust to Adversarial Corruptions

Liu, Junyan; Li, Shuai; Li, Dapeng

doi:10.48550/arxiv.2106.04207

Cited by 2 publications

(2 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Most of their paper involves a different communication model where the agents/clients collaborate via a central server; Section 6 studies a "peer-to-peer" model which is closer to ours but requires additional assumptions on the number of malicious neighbors. A different line of work considers the case where an adversary can corrupt the observed rewards (see, e.g., [11,12,25,26,29,33,39,40,43], and the references therein), which is distinct from the role that malicious agents play in our setting.…”

Section: Other Related Workmentioning

confidence: 99%

Robust Multi-Agent Bandits Over Undirected Graphs

Vial¹,

Shakkottai²,

Srikant³

2022

Preprint

View full text Add to dashboard Cite

We consider a multi-agent multi-armed bandit setting in which n honest agents collaborate over a network to minimize regret but m malicious agents can disrupt learning arbitrarily. Assuming the network is the complete graph, existing algorithms incur O((m + K/n) log(T )/∆) regret in this setting, where K is the number of arms and ∆ is the arm gap. For m K, this improves over the single-agent baseline regret of O(K log(T )/∆).In this work, we show the situation is murkier beyond the case of a complete graph. In particular, we prove that if the state-of-the-art algorithm is used on the undirected line graph, honest agents can suffer (nearly) linear regret until time is doubly exponential in K and n. In light of this negative result, we propose a new algorithm for which the i-th agent has regret O((d mal (i) + K/n) log(T )/∆) on any connected and undirected graph, where d mal (i) is the number of i's neighbors who are malicious. Thus, we generalize existing regret bounds beyond the complete graph (where d mal (i) = m), and show the effect of malicious agents is entirely local (in the sense that only the d mal (i) malicious agents directly connected to i affect its long-term regret).

show abstract

Section: Other Related Workmentioning

confidence: 99%

Robust Multi-Agent Bandits Over Undirected Graphs

Vial¹,

Shakkottai²,

Srikant³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…IV). Other forms of attacks are also studied [28]- [31], including the "weak attack" model [32]- [34] where attacks are performed before observing actions. Note that the attackers in all these works have no desire to explore the environment, while the reward-teaching server has to actively learn the global model.…”

Section: Related Workmentioning

confidence: 99%

An Attackability Perspective on No-Sensing Adversarial Multi-player Multi-armed Bandits

Shi

Shen

2021

2021 IEEE International Symposium on Information Theory (ISIT)

View full text Add to dashboard Cite

Most of the existing federated multi-armed bandits (FMAB) designs are based on the presumption that clients will implement the specified design to collaborate with the server. In reality, however, it may not be possible to modify the client's existing protocols. To address this challenge, this work focuses on clients who always maximize their individual cumulative rewards, and introduces a novel idea of "reward teaching", where the server guides the clients towards global optimality through implicit local reward adjustments. Under this framework, the server faces two tightly coupled tasks of bandit learning and target teaching, whose combination is non-trivial and challenging. A phased approach, called Teaching-After-Learning (TAL), is first designed to encourage and discourage clients' explorations separately. General performance analyses of TAL are established when the clients' strategies satisfy certain mild requirements. With novel technical approaches developed to analyze the warmstart behaviors of bandit algorithms, particularized guarantees of TAL with clients running UCB or ε-greedy strategies are then obtained. These results demonstrate that TAL achieves logarithmic regrets while only incurring logarithmic adjustment costs, which is order-optimal w.r.t. a natural lower bound. As a further extension, the Teaching-While-Learning (TWL) algorithm is developed with the idea of successive arm elimination to break the non-adaptive phase separation in TAL. Rigorous analyses demonstrate that when facing clients with UCB1, TWL outperforms TAL in terms of the dependencies on sub-optimality gaps thanks to its adaptive design. Experimental results demonstrate the effectiveness and generality of the proposed algorithms.

show abstract

Cooperative Stochastic Multi-agent Multi-armed Bandits Robust to Adversarial Corruptions

Cited by 2 publications

References 19 publications

Robust Multi-Agent Bandits Over Undirected Graphs

Robust Multi-Agent Bandits Over Undirected Graphs

An Attackability Perspective on No-Sensing Adversarial Multi-player Multi-armed Bandits

Contact Info

Product

Resources

About