2017
DOI: 10.1016/j.neucom.2016.09.141
|View full text |Cite
|
Sign up to set email alerts
|

Softmax exploration strategies for multiobjective reinforcement learning

Abstract: Despite growing interest over recent years in applying reinforcement learning to multiobjective problems, there has been little research into the applicability and effectiveness of exploration strategies within the multiobjective context. This work considers several widely-used approaches to exploration from the single-objective reinforcement learning literature, and examines their incorporation into multiobjective Q-learning. In particular this paper proposes two novel approaches which extend the softmax oper… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

1
37
0

Year Published

2017
2017
2024
2024

Publication Types

Select...
6
2

Relationship

4
4

Authors

Journals

citations
Cited by 49 publications
(38 citation statements)
references
References 36 publications
1
37
0
Order By: Relevance
“…In order to address the issues with a deterministic policy function, a causal entropy regularization method has been utilized [6]- [10]. This is mainly due to the fact that the optimal solution of an MDP with causal entropy regularization becomes a softmax distribution of state-action values Q(s, a), i.e., π(a|s) = exp(Q(s,a)) a exp(Q(s,a )) , which is often referred to as a soft MDP [11].…”
Section: Introductionmentioning
confidence: 99%
“…In order to address the issues with a deterministic policy function, a causal entropy regularization method has been utilized [6]- [10]. This is mainly due to the fact that the optimal solution of an MDP with causal entropy regularization becomes a softmax distribution of state-action values Q(s, a), i.e., π(a|s) = exp(Q(s,a)) a exp(Q(s,a )) , which is often referred to as a soft MDP [11].…”
Section: Introductionmentioning
confidence: 99%
“…The results in Section 7 indicated this did not occur when the target was balanced with regards to both objectives,which suggests the cause is that an unbalanced target leads to some base policies being active only infrequently, resulting in their learning being slowed due to more frequent clearing of the eligibility traces. Future work should investigate whether alternative approaches to exploration (such as optimistic initialisation, or exploratory selection at policy rather than action level) may ameliorate this issue [33].…”
Section: Discussionmentioning
confidence: 99%
“…The Deep Sea Treasure task [2,3,6] is a bi-objective environment consisting of ten Pareto-optimal states, which has often been used for testing MORL algorithms. The Bonus World used in [7] is an original three objective environment. Another bi-objective environment that has been used to evaluate a novel multi-objective RL algorithm is the Linked Rings problem [3].…”
mentioning
confidence: 99%
“…The methodological approach. Many of the proposed MORL algorithms use variants of the Q-learning algorithm [2][3][4][5][6][7]. In [5], multi-objectivization is used to create additional objectives next to solving the primary goal in order to improve the empirical efficiency.…”
mentioning
confidence: 99%
See 1 more Smart Citation