Softmax exploration strategies for multiobjective reinforcement learning
Introduction
Most reinforcement learning (RL) algorithms consider only a single objective, encoded in a scalar reward. However, many sequential decision-making tasks naturally have multiple conflicting objectives, such as minimising both distance and risk of congestion in path-finding [1], [2], optimizing both energy efficiency and quality of communication in wireless networks [3], and control tasks with multiple performance indices [4]. Therefore recent years have seen multiobjective reinforcement learning (MORL) emerge as a growing area of research [5]. One issue which has yet to be investigated to any significant extent is the role of exploration in a multiobjective context. Balancing the exploitation of the agent’s current learning against the potential to improve the current policy through exploratory actions is critical to the performance of a RL agent, particularly in online learning [6]. While there has recently been extensive examination of this trade-off for multiobjective multi-armed bandits (e.g. [7], [8]), there has been no work yet addressing exploration in general multiobjective environments with multiple states.
This paper considers three approaches to exploration which have been widely used in the single-objective reinforcement learning literature (ϵ-greedy exploration, softmax exploration and optimistic initialisation), and examines how they can be applied in the context of multiobjective reinforcement learning. The three methods are incorporated into a multiobjective formulation of the Q(λ) learning algorithm, which in some cases requires modifications to aspects of the exploration algorithm, and then evaluated across three benchmark multiobjective environments. The results of these empirical evaluations demonstrate that exploration in a multiobjective environment differs from exploration in single-objective reinforcement learning, and provide insight into the correct choice of exploration strategy and settings.
Section snippets
Background
This section of the paper provides the necessary background information on multiobjective reinforcement learning (and specifically value-based approaches to MORL), and single-objective exploration methods. Section 3 will build on this background to explore what modifications, if any, are required to adapt these exploration methods to apply to multiobjective RL.
Exploration in multiobjective RL
In this section we will consider how each of the exploration strategies discussed in 2.2 can be applied in the context of multiple objectives.
Extending ϵ-greedy exploration to multiple objectives is straightforward. When an action is to be selected greedily, then the appropriate multi-objective action selection operation is used (TLO in our case). Otherwise an action is selected randomly. This has been the predominant exploration approach adopted in the MORL literature so far [12], [15], [16],
Experimental methodology
The remainder of this paper presents an empirical comparison of the multiobjective exploration strategies described in Section 3. To evaluate the effectiveness of each method in balancing the trade-off between exploration and exploitation it is necessary to consider the agent’s performance both during learning (its online performance) and also the quality of the final greedy policy learnt by the agent (its offline performance). Ideally an agent should perform well on both of these measures.
ϵ-greedy exploration results
ϵ-greedy exploration is both the simplest form of exploration, and also the most widely-used in the MORL literature. Therefore the results achieved using this approach will be presented first as a baseline for comparison with the other approaches. Fig. 6 illustrates both the online and offline error achieved when applying ϵ-greedy exploration to each of the benchmarks with a range of different settings for the initial value of ϵ. It can be seen that for the DST and Bonus World tasks this
Conclusion and future work
This work has established several benchmarks for evaluating exploration in MORL, and has used these to highlight issues which may arise when applying the simple exploration methods commonly used in single-objective RL in a multiobjective context. In particular it has been shown that the simple combination of optimistic initialisation and greedy action-selection which has been widely used in single-objective RL performs extremely poorly when applied to multiobjective problems in combination with
Dr Peter Vamplew is an Associate Professor in Information Technology at Federation University Australia. His research interests lie primarily in the field of artificial intelligence, particularly reinforcement learning. For the last decade he has pioneered the extension of reinforcement learning algorithms to problems with multiple objectives.
References (48)
- et al.
Find multi-objective paths in stochastic networks via chaotic immune PSO
Expert Syst. Appl.
(2010) - et al.
Reinforcement learning optimization for base station sleeping strategy in coordinated multipoint (CoMP) communications
Neurocomputing
(2015) - et al.
Model-free multiobjective approximate dynamic programming for discrete-time nonlinear systems with general performance index functions
Neurocomputing
(2009) - et al.
Control of exploitation–exploration meta-parameter in reinforcement learning
Neural Netw.
(2002) - et al.
Multi-agent multi-objective learning using heuristically accelerated reinforcement learning
Robotics Symposium and Latin American Robotics Symposium (SBR-LARS), 2012 Brazilian
(2012) - et al.
Multi-objective path finding in stochastic networks using a biogeography-based optimization method
Simulation
(2016) - et al.
A survey of multi-objective sequential decision-making
J. Artif. Intell. Res.
(2013) Efficient Exploration In Reinforcement Learning
Technical Report CMU-CS-92-102
(1992)- et al.
Exploration versus exploitation trade-off in infinite horizon pareto multi-armed bandits algorithms
The Seventh Conference of Agents and Artificial Intelligence
(2015) - et al.
Designing multi-objective multi-armed bandits algorithms: a study
International Joint Conference on Neural Networks
(2013)
Utility theory for decision making
Technical Report
Multiobjective reinforcement learning: a comprehensive overview
IEEE Trans. Syst., Man, Cybern.: Syst.,
A reinforcement learning approach to setting multi-objective goals for energy demand management
Int. J. Agent Technol. Syst.
Responsive elastic computing
International Conference on Autonomic Computing
On the limitations of scalarisation for multi-objective reinforcement learning of Pareto fronts
AI’08: The 21st Australasian Joint Conference on Artificial Intelligence
Multi-criteria reinforcement learning
The Fifteenth International Conference on Machine Learning
Hypervolume-based multi-objective reinforcement learning
Evolutionary Multi-Criterion Optimization
Scalarized multi-objective reinforcement learning: novel design techniques
Proceedings of the IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning. IEEE
Reinforcement learning for MDPs with constraints
European Conference on Machine Learning
Risk-sensitive reinforcement learning applied to control under constraints
J. Artif. Intell. Res.
An empirical comparison of two common multiobjective reinforcement learning algorithms
AI2012: The 25th Australasian Joint Conference on Artificial Intelligence
Explorations in efficient reinforcement learning
A bayesian sampling approach to exploration in reinforcement learning
Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence
Cited by (50)
Multi-objective deep reinforcement learning for optimal design of wind turbine blade
2023, Renewable EnergyMulti-objective fuzzy Q-learning to solve continuous state-action problems
2023, NeurocomputingCitation Excerpt :On the other hand, the algorithm is unable to find the non-convex parts of the Pareto front. A softmax exploration strategy is proposed for MORL algorithms in [28]. They implement the modified exploration mechanisms common in single-objective problems and study the method’s effectiveness in multi-objective cases.
Q-Managed: A new algorithm for a multiobjective reinforcement learning
2021, Expert Systems with ApplicationsCitation Excerpt :Whilst, some representative algorithms for single policy are W-Learning (Humphrys, 1997), Modular Q-Learning (Karlsson, 1997), Scalarized MORL (Van Moffaert et al., 2013b), W-Steering and Q-Steering (Vamplew et al., 2015). On the other hand, using the Multi-objective Q-Learning algorithm, Vamplew, Dazeley et al. (2017) proposed Softmax as a action selection strategy for MORL. The other approach involves the search for multiple policies, which can be simultaneous – in a single run – or iteratively — one policy per run.
Potential-based multiobjective reinforcement learning approaches to low-impact agents for AI safety
2021, Engineering Applications of Artificial Intelligence
Dr Peter Vamplew is an Associate Professor in Information Technology at Federation University Australia. His research interests lie primarily in the field of artificial intelligence, particularly reinforcement learning. For the last decade he has pioneered the extension of reinforcement learning algorithms to problems with multiple objectives.
Dr Richard Dazeley is the Head of Discipline (Information Technology) and a Senior Lecturer at Federation University Australia. He is recognised internationally as a leading researcher in multiobjective reinforcement learning where he has published a number of significant papers. He is also a nationally and internationally leading researcher in the field of Ripple Down Rules (RDR). Richard also has industry experience in the implementation of production rule technology on large government software systems.
Dr Cameron Foale completed his PhD studies in 2010 in the field of acoustics for virtual environments, and subsequently entered the IT industry as a web and games developer in the digital education sector. He returned to teaching and research at Federation University Australia in 2014, with research interests including interactive systems, eHealth and reinforcement learning.