Adaptive Algorithms for Multi-armed Bandit with Composite and Anonymous Feedback

Wang, Siwei; Wang, Haoyun; Huang, Longbo

doi:10.1609/aaai.v35i11.17224

Cited by 3 publications

(2 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Cesa-Bianchi et al (2018) generalized this setting to a case where the reward generated by an action is not simply revealed to the agent at a single instant in the future, but rather spreads over multiple rounds. Recent work along this line is also found in Garg and Akash (2019), Zhang et al (2022), and Wang et al (2021).…”

Section: Related Workmentioning

confidence: 68%

Stochastic Contextual Bandits with Long Horizon Rewards

Qin

Pasqualetti

et al. 2023

AAAI

View full text Add to dashboard Cite

The growing interest in complex decision-making and language modeling problems highlights the importance of sample-efficient learning over very long horizons. This work takes a step in this direction by investigating contextual linear bandits where the current reward depends on at most s prior actions and contexts (not necessarily consecutive), up to a time horizon of h. In order to avoid polynomial dependence on h, we propose new algorithms that leverage sparsity to discover the dependence pattern and arm parameters jointly. We consider both the data-poor (T= h) regimes and derive respective regret upper bounds O(d square-root(sT) +min(q, T) and O( square-root(sdT) ), with sparsity s, feature dimension d, total time horizon T, and q that is adaptive to the reward dependence pattern. Complementing upper bounds, we also show that learning over a single trajectory brings inherent challenges: While the dependence pattern and arm parameters form a rank-1 matrix, circulant matrices are not isometric over rank-1 manifolds and sample complexity indeed benefits from the sparse reward dependence structure. Our results necessitate a new analysis to address long-range temporal dependencies across data and avoid polynomial dependence on the reward horizon h. Specifically, we utilize connections to the restricted isometry property of circulant matrices formed by dependent sub-Gaussian vectors and establish new guarantees that are also of independent interest.

show abstract

Section: Related Workmentioning

confidence: 68%

Stochastic Contextual Bandits with Long Horizon Rewards

Qin

Pasqualetti

et al. 2023

AAAI

View full text Add to dashboard Cite

show abstract

“…Cesa-Bianchi et al [2018] generalized this setting to a case where the reward generated by an action is not simply revealed to the agent at a single instant in the future, but rather spreads over multiple rounds. Recent work along this line is also found in Garg and Akash [2019], Zhang et al [2022], and Wang et al [2021]. In this paper, we consider a contextual setting with, which is different from the above ones and poses new challenges since each arm no longer has a fixed reward distribution.…”

Section: Related Workmentioning

confidence: 97%

Stochastic Contextual Bandits with Long Horizon Rewards

Qin¹,

Li²,

Pasqualetti³

et al. 2023

Preprint

View full text Add to dashboard Cite

The growing interest in complex decision-making and language modeling problems highlights the importance of sample-efficient learning over very long horizons. This work takes a step in this direction by investigating contextual linear bandits where the current reward depends on at most s prior actions and contexts (not necessarily consecutive), up to a time horizon of h. In order to avoid polynomial dependence on h, we propose new algorithms that leverage sparsity to discover the dependence pattern and arm parameters jointly. We consider both the data-poor (T < h) and data-rich (T ≥ h) regimes, and derive respective regret upper bounds Õ(d √ sT + min{q, T }) and Õ( √ sdT ), with sparsity s, feature dimension d, total time horizon T , and q that is adaptive to the reward dependence pattern. Complementing upper bounds, we also show that learning over a single trajectory brings inherent challenges: While the dependence pattern and arm parameters form a rank-1 matrix, circulant matrices are not isometric over rank-1 manifolds and sample complexity indeed benefits from the sparse reward dependence structure. Our results necessitate a new analysis to address long-range temporal dependencies across data and avoid polynomial dependence on the reward horizon h. Specifically, we utilize connections to the restricted isometry property of circulant matrices formed by dependent sub-Gaussian vectors and establish new guarantees that are also of independent interest.

show abstract

A Modified EXP3 in Adversarial Bandits with Multi-user Delayed Feedback

Li,

Guo

2023

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Adaptive Algorithms for Multi-armed Bandit with Composite and Anonymous Feedback

Cited by 3 publications

References 15 publications

Stochastic Contextual Bandits with Long Horizon Rewards

Stochastic Contextual Bandits with Long Horizon Rewards

Stochastic Contextual Bandits with Long Horizon Rewards

A Modified EXP3 in Adversarial Bandits with Multi-user Delayed Feedback

Contact Info

Product

Resources

About