Elsevier

Neurocomputing

Volume 263, 8 November 2017, Pages 74-86
Neurocomputing

Softmax exploration strategies for multiobjective reinforcement learning

https://doi.org/10.1016/j.neucom.2016.09.141Get rights and content

Abstract

Despite growing interest over recent years in applying reinforcement learning to multiobjective problems, there has been little research into the applicability and effectiveness of exploration strategies within the multiobjective context. This work considers several widely-used approaches to exploration from the single-objective reinforcement learning literature, and examines their incorporation into multiobjective Q-learning. In particular this paper proposes two novel approaches which extend the softmax operator to work with vector-valued rewards. The performance of these exploration strategies is evaluated across a set of benchmark environments. Issues arising from the multiobjective formulation of these benchmarks which impact on the performance of the exploration strategies are identified. It is shown that of the techniques considered, the combination of the novel softmax–epsilon exploration with optimistic initialisation provides the most effective trade-off between exploration and exploitation.

Introduction

Most reinforcement learning (RL) algorithms consider only a single objective, encoded in a scalar reward. However, many sequential decision-making tasks naturally have multiple conflicting objectives, such as minimising both distance and risk of congestion in path-finding [1], [2], optimizing both energy efficiency and quality of communication in wireless networks [3], and control tasks with multiple performance indices [4]. Therefore recent years have seen multiobjective reinforcement learning (MORL) emerge as a growing area of research [5]. One issue which has yet to be investigated to any significant extent is the role of exploration in a multiobjective context. Balancing the exploitation of the agent’s current learning against the potential to improve the current policy through exploratory actions is critical to the performance of a RL agent, particularly in online learning [6]. While there has recently been extensive examination of this trade-off for multiobjective multi-armed bandits (e.g. [7], [8]), there has been no work yet addressing exploration in general multiobjective environments with multiple states.

This paper considers three approaches to exploration which have been widely used in the single-objective reinforcement learning literature (ϵ-greedy exploration, softmax exploration and optimistic initialisation), and examines how they can be applied in the context of multiobjective reinforcement learning. The three methods are incorporated into a multiobjective formulation of the Q(λ) learning algorithm, which in some cases requires modifications to aspects of the exploration algorithm, and then evaluated across three benchmark multiobjective environments. The results of these empirical evaluations demonstrate that exploration in a multiobjective environment differs from exploration in single-objective reinforcement learning, and provide insight into the correct choice of exploration strategy and settings.

Section snippets

Background

This section of the paper provides the necessary background information on multiobjective reinforcement learning (and specifically value-based approaches to MORL), and single-objective exploration methods. Section 3 will build on this background to explore what modifications, if any, are required to adapt these exploration methods to apply to multiobjective RL.

Exploration in multiobjective RL

In this section we will consider how each of the exploration strategies discussed in 2.2 can be applied in the context of multiple objectives.

Extending ϵ-greedy exploration to multiple objectives is straightforward. When an action is to be selected greedily, then the appropriate multi-objective action selection operation is used (TLO in our case). Otherwise an action is selected randomly. This has been the predominant exploration approach adopted in the MORL literature so far [12], [15], [16],

Experimental methodology

The remainder of this paper presents an empirical comparison of the multiobjective exploration strategies described in Section 3. To evaluate the effectiveness of each method in balancing the trade-off between exploration and exploitation it is necessary to consider the agent’s performance both during learning (its online performance) and also the quality of the final greedy policy learnt by the agent (its offline performance). Ideally an agent should perform well on both of these measures.

ϵ-greedy exploration results

ϵ-greedy exploration is both the simplest form of exploration, and also the most widely-used in the MORL literature. Therefore the results achieved using this approach will be presented first as a baseline for comparison with the other approaches. Fig. 6 illustrates both the online and offline error achieved when applying ϵ-greedy exploration to each of the benchmarks with a range of different settings for the initial value of ϵ. It can be seen that for the DST and Bonus World tasks this

Conclusion and future work

This work has established several benchmarks for evaluating exploration in MORL, and has used these to highlight issues which may arise when applying the simple exploration methods commonly used in single-objective RL in a multiobjective context. In particular it has been shown that the simple combination of optimistic initialisation and greedy action-selection which has been widely used in single-objective RL performs extremely poorly when applied to multiobjective problems in combination with

Dr Peter Vamplew is an Associate Professor in Information Technology at Federation University Australia. His research interests lie primarily in the field of artificial intelligence, particularly reinforcement learning. For the last decade he has pioneered the extension of reinforcement learning algorithms to problems with multiple objectives.

References (48)

  • P.C. Fishburn

    Utility theory for decision making

    Technical Report

    (1970)
  • LiuC. et al.

    Multiobjective reinforcement learning: a comprehensive overview

    IEEE Trans. Syst., Man, Cybern.: Syst.,

    (2015)
  • GuoY. et al.

    A reinforcement learning approach to setting multi-objective goals for energy demand management

    Int. J. Agent Technol. Syst.

    (2009)
  • J. Perez et al.

    Responsive elastic computing

    International Conference on Autonomic Computing

    (2009)
  • P. Vamplew et al.

    On the limitations of scalarisation for multi-objective reinforcement learning of Pareto fronts

    AI’08: The 21st Australasian Joint Conference on Artificial Intelligence

    (2008)
  • Z. Gabor et al.

    Multi-criteria reinforcement learning

    The Fifteenth International Conference on Machine Learning

    (1998)
  • K. Van Moffaert et al.

    Hypervolume-based multi-objective reinforcement learning

    Evolutionary Multi-Criterion Optimization

    (2013)
  • K. Van Moffaert et al.

    Scalarized multi-objective reinforcement learning: novel design techniques

    Proceedings of the IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning. IEEE

    (2013)
  • P. Geibel

    Reinforcement learning for MDPs with constraints

    European Conference on Machine Learning

    (2006)
  • P. Geibel et al.

    Risk-sensitive reinforcement learning applied to control under constraints

    J. Artif. Intell. Res.

    (2005)
  • R. Issabekov et al.

    An empirical comparison of two common multiobjective reinforcement learning algorithms

    AI2012: The 25th Australasian Joint Conference on Artificial Intelligence

    (2012)
  • M.A. Wiering

    Explorations in efficient reinforcement learning

    (1999)
  • J. Asmuth et al.

    A bayesian sampling approach to exploration in reinforcement learning

    Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence

    (2009)
  • M. Grześ, D. Kudenko, Improving optimistic exploration in model-free reinforcement learning, Adaptive and Natural...
  • Cited by (50)

    • Multi-objective fuzzy Q-learning to solve continuous state-action problems

      2023, Neurocomputing
      Citation Excerpt :

      On the other hand, the algorithm is unable to find the non-convex parts of the Pareto front. A softmax exploration strategy is proposed for MORL algorithms in [28]. They implement the modified exploration mechanisms common in single-objective problems and study the method’s effectiveness in multi-objective cases.

    • Q-Managed: A new algorithm for a multiobjective reinforcement learning

      2021, Expert Systems with Applications
      Citation Excerpt :

      Whilst, some representative algorithms for single policy are W-Learning (Humphrys, 1997), Modular Q-Learning (Karlsson, 1997), Scalarized MORL (Van Moffaert et al., 2013b), W-Steering and Q-Steering (Vamplew et al., 2015). On the other hand, using the Multi-objective Q-Learning algorithm, Vamplew, Dazeley et al. (2017) proposed Softmax as a action selection strategy for MORL. The other approach involves the search for multiple policies, which can be simultaneous – in a single run – or iteratively — one policy per run.

    View all citing articles on Scopus

    Dr Peter Vamplew is an Associate Professor in Information Technology at Federation University Australia. His research interests lie primarily in the field of artificial intelligence, particularly reinforcement learning. For the last decade he has pioneered the extension of reinforcement learning algorithms to problems with multiple objectives.

    Dr Richard Dazeley is the Head of Discipline (Information Technology) and a Senior Lecturer at Federation University Australia. He is recognised internationally as a leading researcher in multiobjective reinforcement learning where he has published a number of significant papers. He is also a nationally and internationally leading researcher in the field of Ripple Down Rules (RDR). Richard also has industry experience in the implementation of production rule technology on large government software systems.

    Dr Cameron Foale completed his PhD studies in 2010 in the field of acoustics for virtual environments, and subsequently entered the IT industry as a web and games developer in the digital education sector. He returned to teaching and research at Federation University Australia in 2014, with research interests including interactive systems, eHealth and reinforcement learning.

    View full text