Steering approaches to Pareto-optimal multiobjective reinforcement learning

Vamplew, Peter; Issabekov, Rustam; Dazeley, Richard; Foale, Cameron; Berry, Adam; Moore, Tim; Creighton, Douglas

doi:10.1016/j.neucom.2016.08.152

Cited by 25 publications

(20 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Otherwise an action is selected randomly. This has been the predominant exploration approach adopted in the MORL literature so far [12,15,16,19,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45].…”

Section: Exploration In Multiobjective Rlmentioning

confidence: 99%

Softmax exploration strategies for multiobjective reinforcement learning

2017

Self Cite

View full text Add to dashboard Cite

Despite growing interest over recent years in applying reinforcement learning to multiobjective problems, there has been little research into the applicability and effectiveness of exploration strategies within the multiobjective context. This work considers several widely-used approaches to exploration from the single-objective reinforcement learning literature, and examines their incorporation into multiobjective Q-learning. In particular this paper proposes two novel approaches which extend the softmax operator to work with vectorvalued rewards. The performance of these exploration strategies is evaluated across a set of benchmark environments. Issues arising from the multiobjective formulation of these benchmarks which impact on the performance of the exploration strategies are identified. It is shown that of the techniques considered, the combination of the novel softmax-epsilon exploration with optimistic initialisation provides the most effective trade-off between exploration and exploitation.

show abstract

Section: Exploration In Multiobjective Rlmentioning

confidence: 99%

Softmax exploration strategies for multiobjective reinforcement learning

2017

Self Cite

View full text Add to dashboard Cite

show abstract

“…In this issue, all the papers use benchmark environments with two or three objectives. The Deep Sea Treasure task [2,3,6] is a bi-objective environment consisting of ten Pareto-optimal states, which has often been used for testing MORL algorithms. The Bonus World used in [7] is an original three objective environment.…”

mentioning

confidence: 99%

“…The Bonus World used in [7] is an original three objective environment. Another bi-objective environment that has been used to evaluate a novel multi-objective RL algorithm is the Linked Rings problem [3]. Some of the used environments consist of continuous state variables.…”

mentioning

confidence: 99%

“…The methodological approach. Many of the proposed MORL algorithms use variants of the Q-learning algorithm [2][3][4][5][6][7]. In [5], multi-objectivization is used to create additional objectives next to solving the primary goal in order to improve the empirical efficiency.…”

mentioning

confidence: 99%

“…The empirical performance is improved using multiple importance sampling estimators. In [3], the authors use a variant of geometric steering for multi-objective stochastic games with scalarized reward vectors. The MORL algorithm in [4] is an interesting mixture of on-line learning for the first objective and off-line learning for two independently found secondary objectives.…”

mentioning

confidence: 99%

See 2 more Smart Citations

Special issue on multi-objective reinforcement learning

et al. 2017

Self Cite

View full text Add to dashboard Cite

show abstract

The impact of environmental stochasticity on value-based multiobjective reinforcement learning

Vamplew

Foale

Dazeley

2021

Neural Comput & Applic

Self Cite

View full text Add to dashboard Cite

A common approach to address multiobjective problems using reinforcement learning methods is to extend model-free, value-based algorithms such as Q-learning to use a vector of Q-values in combination with an appropriate action selection mechanism that is often based on scalarisation. Most prior empirical evaluation of these approaches has focused on deterministic environments. This study examines the impact on stochasticity in rewards and state transitions on the behaviour of multi-objective Q-learning. It shows that the nature of the optimal solution depends on these environmental characteristics, and also on whether we desire to maximise the Expected Scalarised Return (ESR) or the Scalarised Expected Return (SER). We also identify a novel aim which may arise in some applications of maximising SER subject to satisfying constraints on the variation in return, and show that this may require different solutions than ESR or conventional SER.The analysis of the interaction between environmental stochasticity and multiobjective Q-learning is supported by empirical evaluations on several simple multiobjective Markov Decision Processes with varying characteristics. This includes a demonstration of a novel approach to learning deterministic SER-optimal policies for environments with stochastic rewards. In addition, we report a previously unidentified issue with model-free, value-based approaches to multiobjective reinforcement learning in the context of environments with stochastic state transitions. Having highlighted the limitations of value-based model-free MORL methods, we discuss several alternative methods that may be more suitable for maximising SER in MOMDPs with stochastic transitions.

show abstract

Steering approaches to Pareto-optimal multiobjective reinforcement learning

Cited by 25 publications

References 23 publications

Softmax exploration strategies for multiobjective reinforcement learning

Softmax exploration strategies for multiobjective reinforcement learning

Special issue on multi-objective reinforcement learning

The impact of environmental stochasticity on value-based multiobjective reinforcement learning

Contact Info

Product

Resources

About