Elsevier

Neurocomputing

Volume 359, 24 September 2019, Pages 58-68
Neurocomputing

Multi-agent behavioral control system using deep reinforcement learning

https://doi.org/10.1016/j.neucom.2019.05.062Get rights and content

Abstract

Deep reinforcement learning (DRL) has emerged as the dominant approach to achieving successive advancements in the creation of human-wise agents. By leveraging neural networks as decision-making controllers, DRL supplements traditional reinforcement methods to address the curse of dimensionality in complicated tasks. However, agents in complicated environments are likely to get stuck in sub-optimal solutions. In such cases, the agent inadvertently turns into a “zombie” owing to its short-term vision and harmful behaviors. In this study, we use human learning strategies to adjust agent behaviors in high-dimensional environments. Therefore, the agent behaves predictably and succeeds in attaining its designated goal. In summary, the contribution of this study is two-fold. First, we introduce a lightweight workflow that enables a nonexpert to preserve a certain level of safety in AI systems. Specifically, the workflow involves a novel concept of a target map and a multi-agent behavioral control system named Multi-Policy Control System (MPCS). MPCS successfully controls agent behaviors in real time without involving the burden of human feedback. Second, we develop a multi-agent game named Tank Battle that provides a configurable environment to examine agent behaviors and human-agent interactions in DRL. Finally, simulation results show that agents guided by MPCS outperform agents that do not use MPCS with respect to the mean of total rewards and human-like behaviors in complicated environments such as Seaquest and Tank Battle.

Introduction

First introduced in the late 1980s, reinforcement learning (RL) has guided research on robotics and autonomous systems with significant success [1], [2]. Since its inception, RL methods have been gaining popularity because an RL agent is capable of mimicking human learning behaviors while it interacts with the environment. However, traditional RL methods cease to function in high-dimensional environments in which the computational power for action prediction grows drastically with the increase in the number of dimensions, and thus leading to the curse of dimensionality. This problem limits the use of RL methods in complicated environments. To overcome this shortcoming, one solution involves using a deep neural network as the function approximator for action prediction, i.e., deep RL. Deep RL essentially yields remarkable results in complicated problems, such as backgammon [3], [4], IBM WATSON’s “Daily-Double” wagering [5], Atari domain [6], the game of Go [7], and complex behaviors in unmanned vehicles [8], [9].

There is an undeniable truth that sophisticated AI systems are increasingly beneficial for humanity. However, it is also critical to consider risk factors in designing an AI system, especially in human-machine systems [10], [11]. Furthermore, Google Brain’s safety team [12] indicates that various unintended factors produce accidents and unexpected behaviors of agents in machine learning systems, such as wrong objective functions, poorly curated training data, and an insufficiently expressive model. In such situations, the training agent is likely to pursue a sub-optimal solution. In extreme cases, the agent is unable to determine its actual goal and turns into a “zombie” because of its harmful behaviors. Therefore, ensuring safety in AI systems is a challenging task and practically requires an expert’s presence. In this study, we consider a safety level in the sense of how well the agent behaves when there is human interaction. We also introduce a lightweight workflow that enables a certain level of safety in AI systems using deep RL. The solution is able to

  • 1.

    adjust agent behaviors to follow a designated goal,

  • 2.

    enable nonexperts to train agents,

  • 3.

    ease cooperation between humans and agents and among agents in complicated environments,

  • 4.

    control agent behaviors with minimal effort, and

  • 5.

    scale to large systems.

The Arcade Learning Environment (ALE) [13] is a normative testbed for deep RL in the Atari domain. However, the ALE is a ROM-based single-agent emulator with a lack of customizable features. Therefore, the ALE restricts our examination of agent behaviors and human-agent interactions in multi-agent settings. Therefore, we develop a multi-agent environment named Tank Battle. We describe the game Tank Battle in detail in Appendix A.

In this study, we examine our proposed schemes in two complicated environments: the well-studied game Seaquest in Atari series [14] and our proposed Tank Battle game. The schemes can also be applied to other environments in a similar manner. In Atari Seaquest, the player controls a submarine (yellow) that is an underwater shooter, as illustrated in Fig. 1. The goal of the game is to destroy sharks (pink) and enemy submarines (beige) to rescue divers (dark gray) in the sea. The player’s submarine can hold up to eight divers. When the submarine holds the maximum number of divers, it releases all divers to obtain a bonus reward by resurfacing. The submarine has to resurface frequently to avoid oxygen depletion. However, the player loses a life if the submarine resurfaces without any rescued divers. The player obtains a reward of 20 by shooting an enemy, but the player does not get any rewards for rescuing divers. This biased information misleads the agent to maximize the accumulated reward by only shooting enemies until oxygen depletion. Therefore, the agent gets stuck in a sub-optimal solution irrespective of the length of training. The experimental results [15] indicate that an agent, which is trained to play Atari Seaquest with Asynchronous Advantage Actor-Critic (A3C) in 3 days, does not attain any improvements when compared with a 1-day training agent. Because the agent gets stuck in a sub-optimal solution, it exhibits a performance that is 75% lower than human-level performance [6], [15]. We can observe this phenomenon in the online video at https://youtu.be/-008vWYZGTE. In this study, the agent does not get stuck in a sub-optimal solution by following human learning strategies, and thus it is capable of performing on par with the human competent level in only 8 hours of training. To realize this solution, we introduce a concept of a target map and a behavioral control system named Multi-Policy Control System (MPCS). The target map and MPCS are the key factors that realize a lightweight workflow towards building a predictable AI system. We conduct a performance evaluation of our proposed schemes in Atari Seaquest in Section 4.2.

The present study deals with complicated problems by adjusting agent behaviors to follow human learning strategies. Firstly, humans retain essential training information that combines with intrinsic preferences to seek a suitable strategy for actual goal achievement. In Atari Seaquest, for example, a desired agent controls the submarine to shoot enemies and accumulate rewards, but it is able to resurface in the circumstances when the oxygen level is low. Therefore, the player prolongs the submarine’s lifetime and achieves a higher reward in the long run. Secondly, humans typically divide a complicated problem into simpler tasks. Humans then conquer each task by adopting a suitable strategy. Therefore, humans are able to achieve the designated goals without impairment by biased information. From a technical perspective, we represent an observed environment as a target map (e.g., a navigation map for drivers). We then divide the target map into various spotty regions based on the problem domain. In each region, we define a set of targets. The targets dynamically change over time. Finally, we train the agent to learn the target map. By redefining the targets in the target map, we are able to adjust the agent’s behaviors without retraining. Fig. 2 illustrates our approach to a complicated environment by using a target map and human strategies. We discuss the target map in detail in 3.1 Target map, 3.2 Human knowledge integration, 3.3 Target mask, 3.4 Training process.

To facilitate the concept of a target map, we design MPCS that resolves different problem domains. Specifically, MPCS uses a switching gate to control agent behaviors in real time. Furthermore, MPCS can operate in two different settings: a single-agent setting and a multi-agent setting. In the former setting, MPCS controls agents by switching among various regional policy networks. In the latter, MPCS enables collaborations between agents by scheduling each agent to follow a regional policy network. We describe MPCS in further detail in Section 3.5.

In summary, the study contributes the following findings:

  • We develop Tank Battle as a multi-agent environment to analyze agent behaviors and human-agent interactions in deep RL. We intentionally design Tank Battle in the spirit of an Atari game. Therefore, we can reuse any Atari-related deep RL algorithms without changing the existing parameter settings. The source code of Tank Battle and its sample code are available online at https://git.io/fNKtB.

  • The concept of a target map and the design of MPCS exhibit unlimited potential in real-world applications. Firstly, we can combine a target map with any deep RL algorithms to address different problem domains. Secondly, MPCS enables cooperation between humans and agents and among heterogeneous agents in large-scale systems. Finally, MPCS controls agent behaviors in real time without the burden of human feedback and enables nonexperts to preserve a certain level of safety in AI systems.

The rest of the paper is organized as follows. Section 2 summarizes related studies and their shortcomings. Section 3 describes the concept of a target map and implementation details of MPCS. Section 4 presents experimental results of our proposed schemes in Atari Seaquest and Tank Battle in two different settings: the single-player setting and the two-player setting. Finally, Section 5 concludes our study.

Section snippets

Related work

Since the advent of deep Q-network (DQN) [6], there have been various improvements to the DQN structure, such as double Q-network [16], dueling network [17], and prioritized experience replay [18]. However, the DQN variants require a lengthy training time to achieve the human-level performance in Atari games. To overcome this obstacle, Mnih et al. [15] derive an asynchronous approach based on actor-critic architecture [19], i.e., A3C, which requires less than 12 hours to surpass 3–4 days of

Proposed scheme

To realize an AI system that satisfies the five aforementioned conditions in Section 1, we design a behavioral control system that is able to adjust agent behaviors to follow any arbitrary human learning strategies. Therefore, the design process is simplified to a 1–to–1 workflow, i.e., the input of the workflow involves a problem domain and a set of human goals, and the corresponding output is the expected behaviors of agents. The complete workflow comprises four steps as described in Fig. 3:

  • S1.

Experimental settings

In this section, we perform experiments to evaluate the proposed target map and MPCS in Atari Seaquest and Tank Battle. As explained in Section 2, we use the A3C method as the baseline algorithm for each proposed variant. The network parameters and algorithm settings of the A3C method are the same as in [15] with the following exceptions. We run each A3C variant in an 8-core CPU. The initial learning rate is 0.004. We pass target masks into two convolutional layers: a layer with 16 filters of

Conclusions

This paper proposes the novel concept of a target map and the design of MPCS that are able to adjust agent behaviors in real time. By following human learning strategies, agents attain an expected solution in a reasonable training time. Therefore, agents do not get stuck in sub-optimal solutions and act more human-like in complicated environments. The use of a target map and MPCS also eases cooperation between humans and agents and among agents in a large heterogeneous system. Furthermore,

Conflict of interest

The authors have no Conflict of interest in publishing this work.

Acknowledgment

The authors wish to thank our colleagues in Institute for Intelligent Systems Research and Innovation for their comments and helpful discussions.

Ngoc Duy Nguyen received the M.S. degree in computer engineering from Sungkyunkwan University, Suwon, South Korea, in 2011. He is currently pursuing a Ph.D. at the Institute for Intelligent Systems Research and Innovation (IISRI), Deakin University, Australia. From 2011 to 2016, he was a Project Manager and Researcher at Iritech, Inc., Seoul, South Korea. His work includes research and development of world-leading biometrics and recognition systems. His research interest involves machine

References (32)

  • N. Lawrence, Discussion of ‘superintelligence: Paths, dangers, strategies’, 2016. Available from: URL:...
  • D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, D. Mane, Concrete problems in AI safety,...
  • M.G. Bellemare et al.

    The arcade learning environment: an evaluation platform for general agents

    J. Artif. Intell. Res.

    (2013)
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, M. Riedmiller, Playing atari with deep...
  • V. Mnih et al.

    Asynchronous methods for deep reinforcement learning

    Proceedings of the International Conference on Machine Learning

    (2016)
  • H.V. Hasselt

    Double q-learning

    Neural Inf. Process. Syst.

    (2010)
  • Cited by (21)

    • Human-robot interactions in manufacturing: A survey of human behavior modeling

      2022, Robotics and Computer-Integrated Manufacturing
      Citation Excerpt :

      This taxonomy is designed to unify the conceptualization of all functional allocation for autonomous systems interacting with human decision-makers [8]. Since agents are likely to get stuck in sub-optimal solutions in complicated environments, Nguyen et al. used human learning strategies to adjust artificial agent behaviors in high-dimensional environments [9]. Huang et al. proposed a novel human decision-making behavior model for HRI to control multi-robot systems.

    • Common belief multi-agent reinforcement learning based on variational recurrent models

      2022, Neurocomputing
      Citation Excerpt :

      QMIX [8] used a mixing network to factorize the value functions. Other prominent progress includes but is not limited to studies such as [9–14]. However, all these methods only use centralised critic to coordinate during training, and lack a coordination mechanism among agents during execution.

    • Perception modelling by invariant representation of deep learning for automated structural diagnostic in aircraft maintenance: A study case using DeepSHM

      2022, Mechanical Systems and Signal Processing
      Citation Excerpt :

      Besides these encoding techniques, there is semi-supervised approach that combines the discriminative and the generative approach such as Deep Convolutional Generative Adversarial Network (DCGAN) [92] and Energy-based Generative Adversarial Network [125]. In the area of model-free learning such as reinforcement learning, deep learning can be combined with a multi-agent system of reinforcement learning [38,67,83]. Since such a model-free environment is even more data-exhaustive than supervised learning, currently, the applications of these approaches are most limited to simulated reality such as computer game.

    • Cooperative control for multi-player pursuit-evasion games with reinforcement learning

      2020, Neurocomputing
      Citation Excerpt :

      However, the centralized RL method may face the problem of ‘curse of dimensionality’ as the increasing number of players. The recent research on MARL focuses on the decentralized approaches to make the multi-agent systems more powerful and robust than the individual agent [31–35]. To develop a decentralized strategy, many approaches have been proposed.

    View all citing articles on Scopus

    Ngoc Duy Nguyen received the M.S. degree in computer engineering from Sungkyunkwan University, Suwon, South Korea, in 2011. He is currently pursuing a Ph.D. at the Institute for Intelligent Systems Research and Innovation (IISRI), Deakin University, Australia. From 2011 to 2016, he was a Project Manager and Researcher at Iritech, Inc., Seoul, South Korea. His work includes research and development of world-leading biometrics and recognition systems. His research interest involves machine learning, optimization problems, and system design. In 2011, Mr. Ngoc Duy Nguyen received the best thesis award funded by the Department of Information and Communication Engineering, Sungkyunkwan University, Suwon, South Korea.

    Thanh Nguyen received the Ph.D. in Mathematics and Statistics from Monash University, Australia in 2013. He is currently a Research Fellow at the Institute for Intelligent Systems Research and Innovation (IISRI), Deakin University, Australia. He has published various peer-reviewed papers in the field of computational and artificial intelligence. His current research interests include applied statistics and machine learning. Dr. Nguyen was a visiting scholar with the Computer Science Department at Stanford University, California, USA in 2015. He is a recipient of an Alfred Deakin Postdoctoral Research Fellowship in 2016.

    Saeid Nahavandi received a Ph.D. from Durham University, U.K. in 1991. He is an Alfred Deakin Professor, Pro Vice-Chancellor (Defence Technologies), Chair of Engineering, and the Director for the Institute for Intelligent Systems Research and Innovation at Deakin University. His research interests include modelling of complex systems, robotics and haptics. He has published over 600 papers in various international journals and conferences. He is a Fellow of Engineers Australia (FIEAust), the Institution of Engineering and Technology (FIET) and Senior member of IEEE (SMIEEE). He is the Co-Editor-in-Chief of the IEEE Systems Journal, Associate Editor of the IEEE/ASME Transactions on Mechatronics, Associate Editor of the IEEE Transactions on Systems, Man and Cybernetics: Systems, and an IEEE Access Editorial Board member.

    View full text