2020
DOI: 10.48550/arxiv.2006.05902
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Q-greedyUCB: a New Exploration Policy for Adaptive and Resource-efficient Scheduling

Yu Zhao,
Joohyun Lee,
Wei Chen

Abstract: This paper proposes a learning algorithm to find a scheduling policy that achieves an optimal delay-power trade-off in communication systems. Reinforcement learning (RL) is used to minimize the expected latency for a given energy constraint where the environments such as traffic arrival rates or channel conditions can change over time. For this purpose, this problem is formulated as an infinite-horizon Markov Decision Process (MDP) with constraints. To handle the constrained optimization problem, we adopt the … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(1 citation statement)
references
References 13 publications
(24 reference statements)
0
1
0
Order By: Relevance
“…This approach avoids the phenomenon of "extreme unfairness" during the exploration phase, reduces algorithm complexity, and improves system throughput. Yu et al [41] introduced a variant of Q-Learning called Q-greedyUCB, which combines the average reward algorithm of Q-Learning and the UCB algorithm to achieve an optimal delay-power trade-off scheduling strategy in communication systems. Simulation results demonstrate that this algorithm is more effective than ε-greedy and standard Q-Learning in terms of cumulative reward and convergence speed.…”
Section: Related Workmentioning
confidence: 99%
“…This approach avoids the phenomenon of "extreme unfairness" during the exploration phase, reduces algorithm complexity, and improves system throughput. Yu et al [41] introduced a variant of Q-Learning called Q-greedyUCB, which combines the average reward algorithm of Q-Learning and the UCB algorithm to achieve an optimal delay-power trade-off scheduling strategy in communication systems. Simulation results demonstrate that this algorithm is more effective than ε-greedy and standard Q-Learning in terms of cumulative reward and convergence speed.…”
Section: Related Workmentioning
confidence: 99%