460 likes | 867 Views
Reinforcement Learning Applications in Robotics. Gerhard Neumann, Seminar A, SS 2006. Overview. Policy Gradient Algorithms RL for Quadrupal Locomotion PEGASUS Algorithm Autonomous Helicopter Flight High Speed Obstacle Avoidance RL for Biped Locomotion Poincare-Map RL Dynamic Planning
E N D
Reinforcement Learning Applications in Robotics Gerhard Neumann, Seminar A, SS 2006
Overview • Policy Gradient Algorithms • RL for Quadrupal Locomotion • PEGASUS Algorithm • Autonomous Helicopter Flight • High Speed Obstacle Avoidance • RL for Biped Locomotion • Poincare-Map RL • Dynamic Planning • Hierarchical Approach • RL for Acquisition of Robot Stand-Up Behavior
RL for Quadruped Locomotion [Kohl04] • Simple Policy-Gradient Example • Optimize Gait for Sony-Aibo Robot • Use Parameterized Policy • 12 Parameters • Front + rear locus (height, x-pos, y-pos) • Height of the front and the rear of the body • …
Quadruped Locomotion • Policy: No notion of state – open loop control! • Start with initial Policy • Generate t = 15 random policies Ri • is • Evaluate Value of each policy on the real robot • Estimate gradient for each parameter • Update policy into the direction of the gradient
Quadruped Locomotion • Estimation of the Walking Speed of a policy • Automated process of the Aibos • Each Policy is evaluated 3 times • One Iteration (3 x 15 evaluations) takes 7.5 minutes
Quadruped Gait: Results • Better than the best known gait for AIBO!
Pegasus [Ng00] • Policy Gradient Algorithms: • Use finite time horizon, evaluate Value • Value of a policy in a stochastic environment is hard to estimate • => Stochastic Optimization Process • PEGASUS: • For all policy evaluation trials use fixed set of start states (scenarios) • Use „fixed randomization“ for policy evaluation • Only works for simulations! • The same conditions for each evaluation trial • => Deterministic Optimization Process! • Can be solved by any optimization method • Commonly Used: Gradient Ascent, Random Hill Climbing
Autonomous Helicopter Flight [Ng04a, Ng04b] • Autonomously learn to fly an unmanned helicopter • 70000 $ => Catastrophic Exploration! • Learn Dynamics from the observation of a Human pilot • Use PEGASUS to: • Learn to Hover • Learn to fly complex maneuvers • Inverted Helicopter flight
Helicopter Flight: Model Indenfication • 12 dimensional state space • World Coordinates (Position + Rotation) + Velocities • 4-dimensional actions • 2 rotor-plane pitch • Rotor blade tilt • Tail rotor tilt • Actions are selected every 20 ms
Helicopter Flight: Model Indenfication • Human pilot flies helicopter, data is logged • 391s training data • reduced to 8 dimensions (position can be estimated from velocities) • Learn transition probabilities P(st+1|st, at) • supervised learning with locally weighted linear regression • Model Gaussian noise for stochastic model • Implemented a simulator for model validation
Helicopter Flight: Hover Control • Desired hovering position : • Very Simple Policy Class • Edges are optained by human prior knowledge • Learns more or less linear gains of the controller • Quadratic Reward Function: • punishment for deviation of desired position and orientation
Helicopter Flight: Hover Control • Results: • Better performance than Human Expert (red)
Helicopter Flight: Flying maneuvers • Fly 3 manouvers from the most difficult RC helicopter competition class • Trajectory Following: • punish distance from projected point on trajectory • Additional reward for making progress along the trajectory
Helicopter Flight: Results • Videos: • Video1Video2
Helicopter Flight: Inverse Flight • Very difficult for humans • Unstable! • Recollect data for inverse flight • Use same methods than before • Learned in 4 days! • from data collection to flight experiment • Stable inverted flight controller • sustained position Video
High Speed Obstacle Avoidance [Michels05] • Obstacle Avoidance with RC car in unstructured Environments • Estimate depth information from monocular cues • Learn controller with PEGASUS for obstacle avoidance • Graphical Simulation : Does it work in the real environment?
Estimate Depths Information: • Supervised Learning • Divide image into 16 horizontal stripes • Use features of the strip and the neighbored strips as input vectors. • Target Values (shortest distance within a strip) either from simulation or laser range finders • Linear Regression • Output of the vision system • angle of the strip with the largest distance • Distance of the strip
Obstacle Avoidance: Control • Policy: 6 Parameters • Again, a very simple policy is used • Reward: • Deviation of the desired speed, Number of crashes
Obstacle Avoidance: Results • Using a graphical simulation to train the vision system also works for outdoor environments • Video
RL for Biped Robots • Often used only for simplified planar models • Poincare-Map based RL [Morimoto04] • Dynamic Planning [Stilman05] • Other Examples for RL in real robots: • Strongly Simplify the Problem: [Zhou03]
Poincare Map-Based RL • Improve walking controllers with RL • Poincare map: Intersection-points of an n-dimensional trajectory with an (n-1) dimensional Hyperplane • Predict the state of the biped a half cycle ahead at the phases :
Poincare Map • Learn Mapping: • Input Space : x = (d, d‘) • Distance between stance foot and body • Action Space : • Modulate Via-Points of the joint trajectories • Function Approximator: Receptive Field Locally Weighted Regression (RFWR) with a fixed grid
Via Points • Nominal Trajectories from human walking patterns • Control output is used to modulate via points with a circle • Hand selected via-points • Increment via-points of one joint by the same amount
Learning the Value function • Reward Function: • 0.1 if height of the robot > 0.35m • -0.1 else • Standard SemiMDP update rules • Only need to learn the value function for and • Model-Based Actor-Critic Approach • A … Actor • Update Rule:
Results: • Stable walking performance after 80 trials • Beginning of Learning • End of Learning
Dynamic Programming for Biped Locomotion [Stilman05] • 4-link planar robot • Dynamic Programming for Reduced Dimensional Spaces • Manual temporal decomposition of the problem into phases of single and double support • Use intuitive reductions fo the state space for both phases
State-Increment Dynamic Programming • 8-dimensional state space: • Discretize State-Space by coarse grid • Use Dynamic Programming: • Interval ε is defined as the minimum time intervall required for any state index to change
State Space Considerations • Decompose into 2 state space components (DS + SS) • Important disctinctions between the dynamcs of DS and SS • Periodic System: • DP can not be applied separately to state space components • Establish mapping between the components for the DS and SS transition
State Space Reduction • Double Support: • Constant step length (df) • Can not change during DS • Can change after robot completes SS • Equivalent to 5-bar linkage model • Entire state space can be described by 2 DoF (use k1 and k2) • 5-d state space • 10x16x16x12x12 grid => 368640 States
State Space Reduction • Single Support • Compass 2-link Model • Assume k1 and k2 are constant • Stance knee angle k1 has small range in human walking • Swing knee k2 has strong effect on df, but can be prescribed in accordance with h2 with little effect on the robot‘s CoM • 4-D state space • 35x35x18x18 grid => 396900 states
State-Space Reduction • Phase Transitions • DS to SS transition occurs when the rear foot leaves the ground • Mapping: • SS to DS transition occurs when the swing leg makes contact • Mapping:
Action Space, Rewards • Use discretized torques • DS: hip and both knee joints can accelerate the CoM • Fix hip action to zero to gain better resolution for the knee joints • Discretize 2-D action space from +- 5.4 Nm into 7x7 intervalls • SS: Only choose hip torque • 17 intervalls in the range of +- 1.8 Nm • State x Actions • 398640x49+396900x17 = 26280660 cells (!!) • Reward:
Results • 11 hours of computation • The computed policy locates a limit cycle through the space.
Performance under error • Alter different properties of the robot in simulation • Do not relearn the policy • Wide range of disturbancesare accepted • Even if the used model of the dynamics is incorrect! • Wide set of acceptable states allows the actual trajectory to be distinct from the expected limit cycle
Learning of a Stand-up Behavior [Morimoto00] • Learning to stand-up with a 3-linked planar robot. • 6-D state space • Angles + Velocities • Hierarchical Reinforcement Learning • Task decomposition by Sub-goals • Decompose task into: • Non–linear problem in a lower dimensional space • Nearly-linear problem in a high-dimensional space
Upper-level Learning • Coarse Discretization of postures • No speed information in the state space (3-d state space): • Actions: Select sub-goals • New Sub-goal
Upper-Level Learning • Reward Function: • Reward success of stand-up • Reward also for the success of a subgoal • Choosing sub-goals which are easier to reach from the current state is prefered • Use Q(lambda) learning to learn the sequence of sub-goals
Lower-level learning • Lower level is free to choose at which speed to reach sub-goal (desired posture) • 6-D state space • Use Incremental Normalized Gaussian networks (ING-net) as function approximator • RBF network with rule for allocating new RBF-centers • Action Space: • Torque-Vector:
Lower-level learning • Reward: • -1.5 if the robot falls down • Continuous time actor critic learning [Doya99] • Actor and Critic are learnt with ING-nets. • Control Output: • Combination of linear servo controller and non-linear feedback controller
Results: • Simulation Results • Hierarchical architecture 2x faster than plain architecture • Real Robot • Before Learning • During Learning • After Learning • Learned on average in 749 trials (7/10 learning runs) • Used on average 4.3 subgoals
The end • For People who are interested in using RL: • RL-Toolbox • www.igi.tu-graz.ac.at/ril-toolbox • Thank you
Literature • [Kohl04] Policy Gradient Reinforcement Learning for Fast Quadrupedal Locomotion, N. Kohl and P. Stone, 2005 • [Ng00] PEGASUS : A policy search method for large MDPs and POMDPs, A. Ng and M. Jordan, 2000 • [Ng04a] Autonomous inverted helicopter flight via reinforcement learning, A. Ng et al., 2004 • [Ng04b] Autonomous helicopter flight via reinforcement learning, A. Ng et al., 2004 • [Michels05] High Speed Obstacle Avoidance using Monocular Vision and Reinforcement Learning, J. Michels, A. Saxena and A. Ng, 2005 • [Morimoto04] A Simple Reinforcement Learning Algorithm For Biped Walking, J. Morimoto and C. Atkeson, 2004
Literature • [Stilman05] Dynamic Programming in Reduced Dimensional Spaces: Dynamic Planning for Robust Biped Locomotion, M. Stilman, C. Atkeson and J. Kuffner, 2005 • [Morimoto00] Acquisition of Stand-Up Behavior by a Real Robot using Hierarchical Reinforcement Learning, J. Morimoto and K. Doya, 2000 • [Morimoto98] Hierarchical Reinforcement Learning of Low-Dimensional Subgoals and High-Dimensional Trajectories, J. Morimoto and K. Doya, 1998 • [Zhou03] Dynamic Balance of a biped robot using fuzzy Reinforcement Learning Agents, C. Zhou and Q.Meng, 2003 • [Doya99] Reinforcement Learning in Continuous Time And Space, K. Doya,1999