2018
DOI: 10.1109/tsp.2018.2841822
|View full text |Cite
|
Sign up to set email alerts
|

Multi-objective Contextual Multi-armed Bandit With a Dominant Objective

Abstract: In this paper, we propose a new multi-objective contextual multi-armed bandit (MAB) problem with two objectives, where one of the objectives dominates the other objective. Unlike single-objective MAB problems in which the learner obtains a random scalar reward for each arm it selects, in the proposed problem, the learner obtains a random reward vector, where each component of the reward vector corresponds to one of the objectives and the distribution of the reward depends on the context that is provided to the… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
21
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
8

Relationship

1
7

Authors

Journals

citations
Cited by 32 publications
(21 citation statements)
references
References 30 publications
0
21
0
Order By: Relevance
“…There is already a significant amount of attention given to supervised and unsupervised learning research, but relatively less progress has been made for reinforcement learning [6,7]. The main goal of our study is demonstrate that quantum neural networks can be used to solve problems in reinforcement learning, adding a quantum solution to the rich collections of classical methods such as ε-greedy, upper confidence bounds (UCB), and Thompson sampling [22][23][24].…”
Section: Contextual Multi-armed Bandit Problemmentioning
confidence: 99%
“…There is already a significant amount of attention given to supervised and unsupervised learning research, but relatively less progress has been made for reinforcement learning [6,7]. The main goal of our study is demonstrate that quantum neural networks can be used to solve problems in reinforcement learning, adding a quantum solution to the rich collections of classical methods such as ε-greedy, upper confidence bounds (UCB), and Thompson sampling [22][23][24].…”
Section: Contextual Multi-armed Bandit Problemmentioning
confidence: 99%
“…CMAB is widely used in information services to address the cold-start problem. Existing works on CMAB can be divided into three categories according to the content of the context and relation between the context with the arm reward [27].…”
Section: B Contextual Multi-armed Banditmentioning
confidence: 99%
“…By using conditional probability equation, we can derive the following equation: (27) According to Lemma 2, we have: p{T (j) = t|Q j t ≥ µ j + j }e −2 2 j t , so the above equation can be further inferred as:…”
Section: Regret Analysismentioning
confidence: 99%
See 1 more Smart Citation
“…Specifically, we adopt the upper confidence b ound ( UCB) a lgorithm [ 21] to enable a MTD to learn the matching preferences and maximize the long-term optimality performance while maintaining a well-balanced tradeoff between exploitation and exploration. UCB was originally developed to solve the multi-armed bandit (MAB) problem [22], which involves sequential decision making based on only local information. It was designed for the single-player scenario and thereby inevitably leading to selection conflicts in the multi-player scenario where multiple MTDs are prone to select the same channel [23].…”
Section: Introductionmentioning
confidence: 99%