Takeshi Shibuya University of Tsukuba shibuya@iit.tsukuba.ac.jp

A fundamental study on representation of reward for reinforcement learning in dynamic environments + an introduction of rescue simulation Takeshi Shibuya University of Tsukuba shibuya@iit.tsukuba.ac.jp

Outline • Reinforcement learning • A interactive learning framework in soft computing • a method to learn in dynamic environment • RoboCup Rescue: Overview • an application of soft computing Reinforcement learning (theoritical side) Learning in dynamic environment (application side) Rescue simulation

Contents: ・Reinforcement Learning in psychology ・Learning in dynamic environments Reinforcement learning

Reinforcement Learning in psychology Kyoto University If he finishes to push numbers orderly,he gets a peanut as reward.

notable thingsin Reinforcement Learning • The learner • acquires suitable behavior from the only reward. • The trainer • Does not have to tell the learner how to behave step by step.

1 2 What is reinforcement learning(RL)? State reward Value Environment Agent Actions Action • The agent enhances values that bring rewards. • The agent selects the action whose value is highest.

Research themelearning in dynamic environment: • How to learn behavior when suitable action is changed? ？ Action 1 Great reward time Action 2

Research themelearning in dynamic environment: • Dividing reward into two part: • Time-dependent part: to be designed. • Time-independent part: to be learnt

Research themelearning in dynamic environment: Probability of selecting EAST action increases. Proposed method enables the agent to adapt the change of the environment The probability of selecting action switches after the change of environment

Contents: ・Overview of Robocup rescue ・demonstration RobocupRescue

Leagues in RoboCup Ultimate goal of the RoboCup: • By mid-21st century, a team of fully autonomous • humanoid robot soccer players shall win • the soccer game, comply with the official rule of the FIFA, • against the winner of the most recent World Cup. • (from official site) • Soccer • Robot leagues • Simulation leagues • Rescue • Robot leagues • Simulation leagues • 2D • 3D

RoboCup Rescue • The purpose: • (1) to develop simulators that form the infrastructure of the simulation system and emulate realistic phenomena predominant in disasters. (2) to develop intelligent agents and robots that are given the capabilities of the main actors in a disaster response scenario.(from official site) Agent simulation Virtual Robots simulation (Powered by USARSim)

RoboCup Rescue: The agent simulation Buildings: Fire, Collapse Roads : Traffic movement Blocked roads due to rubble etc Emergency services: Fire brigades Ambulance teams Police forces

Agent’s observation and Action

Demonstration/ movie

RoboCup Rescue + RL (Team MRL) • Reinforcement learning is employed for controlling agent. • The details are not shown in the paper. • Team MRL is the champion of RoboCup 2007. (total: 8 teams) OmidAghazadeh+, Implementing Parametric Reinforcement Learning in Robocup Rescue Simulation , RoboCup 2007: Robot Soccer World Cup XI Lecture Notes in Computer Science, 2008, Volume 5001/2008, 409-416, DOI: 10.1007/978-3-540-68847-1_42

Summary • Following topics are overviewed: • Reinforcement learning • The framework and some research theme • RoboCup Rescue • Aims in some leagues and demonstrations

学習の対象 未知の一定量既知の変化量

Reinforcement LearningAs an engineering approach State reward Environment Agent (learner) Action

Deviding reward into two part: • Time-dependent part: to be designed. • Time-independent part: to be learnt

Research Theme 1:learning in Partially observable environment: • If agent can observe four states(angle and angular velocity of each joint ), the agent can control it. • If the agent can not use velocity information,the agent can not determine the direction to be torqued. Torque Angular velocity

100％ 50％ 1 1 2 2 -50％ -100％ Research Theme 1: learning in Partially observable environment: • Complex-valued reinforcement learning enables the agent to overcome the problem by using context of behavior. Swing up

reward function

Takeshi Shibuya University of Tsukuba shibuya@iit.tsukuba.ac.jp