2022
DOI: 10.48550/arxiv.2203.10214
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Thompson Sampling on Asymmetric $α$-Stable Bandits

Abstract: In algorithm optimization in reinforcement learning, how to deal with the exploration-exploitation dilemma is particularly important. Multi-armed bandit problem can optimize the proposed solutions by changing the reward distribution to realize the dynamic balance between exploration and exploitation. Thompson Sampling is a common method for solving multi-armed bandit problem and has been used to explore data that conform to various laws. In this paper, we consider the Thompson Sampling approach for multi-armed… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(3 citation statements)
references
References 10 publications
0
3
0
Order By: Relevance
“…In UCB-type algorithms, the confidence level δ is set to 4/T 2 , maintaining consistency. The prior parameters and tuning parameters for both TS-type algorithms are configured in accordance with the recommendations provided in [23,39] for the MOTS algorithm and GMS generation. The simulation results with different size of p = max k∈[K] p k are shown in Figure 1, Figure 2, and Figure 3.…”
Section: Simulation Resultsmentioning
confidence: 99%
See 2 more Smart Citations
“…In UCB-type algorithms, the confidence level δ is set to 4/T 2 , maintaining consistency. The prior parameters and tuning parameters for both TS-type algorithms are configured in accordance with the recommendations provided in [23,39] for the MOTS algorithm and GMS generation. The simulation results with different size of p = max k∈[K] p k are shown in Figure 1, Figure 2, and Figure 3.…”
Section: Simulation Resultsmentioning
confidence: 99%
“…where the function g(•, ϵ) is defined in Lemma 2. and U µ k = µ At + M 1/(1+ϵ) 32 log t/c k (t) ϵ/(1+ϵ) ; end 'Chambers-Mallows-Stuck (CMS) Generation,' to rescale the non-zero part to a sub-Gaussian tail. Further details on this can be found in [50,17,39]. Diverging from the standard TS algorithm for Gaussian rewards, we use a clipped Gaussian distribution cl N (µ, σ 2 ; ϑ) := max N (µ, σ 2 ), ϑ as the posterior for the non-zero sub-Gaussian part X.…”
Section: Thompson Sampling Approachmentioning
confidence: 99%
See 1 more Smart Citation