Test Recap

Learning Locomotion: Extreme Learning For Extreme TerrainCMU: Chris AtkesonDrew Bagnell, James Kuffner, Martin Stolle, Hanns Tappeiner, Nathan Ratliff, Joel Chestnutt, Michael Dille, Andrew MaasCMU Robotics Institute

Test Recap • Test 1: Everything went well. • Test 2: Straight approach worked well. Side approach: bad plan, bug. • Test 3: Dependence on power supply. Big step up. • Test 4: Need to handle dog variations.

Test 0: Establish Learning Context

Test 1: 5 trials (x10 speedup)

Reinforcement Learning: Reward

Hierarchical Approach • Footstep planner • Body and leg trajectory planner • Execution

Footstep Planner in Action Terrain: Cost map: FootstepPlan:

Global Footstep Path Planning • Use A* to plan a safe sequence of footsteps from the current robot configuration to the goal. • Try to stay as close to that plan as possible during the trial, replan when • We measure that we have deviated from the planned path by a certain amount. • We tip over while taking a step.

A* Details • Cost for each foot location and orientation is pre-computed at startup (usually while the robot is calibrating). • Cost includes angle of terrain, flatness of terrain, distance to any drop-offs, and a heuristic measure of whether the knee will hit any terrain at position and orientation. • Heuristic is currently Euclidean distance to the goal.

Action Model • A base foot location is based on the body position and orientation. • From that base location, a reference set of actions is applied. • 8 actions for each front foot, 1 action for each rear foot. • The front feet “lead” the robot, with the rear feet just following along. Robot Body (from above) Base Foot Location Reference Actions

Adapting the Reference Actions A local search is performed for each action to find a safe location that is near the reference action and still within the reachability of the robot. Reference Actions

Decreasing Safety Margins 2 cm 0 cm

Effect on Paths 2 cm 0 cm

Foot and Body Trajectories • Foot trajectory based on convex hull of intervening terrain. • Body trajectory is newly created at each step, based on the next two steps in the path, and has two stages: • Move into triangle of support before lifting the foot. • Stay within the polygon of support and move forward while foot is in flight.

Interface

Foot Contact Detection • Foot sensor (not reliable). • Predicted foot below terrain. • Z velocity approximately zero and Z acceleration positive. • Compliance? • IMU signals? Not for us. • Motor torque?

Test 2: 5 trials (x10 speedup)

Why did we fail?

IMU rx, ry fl_rx Blue = Actual Red = Desired fr_rx hr_rx

Why did we fail?

Reinforcement Learning: Punishment

Test 3 • Software was set for external power rather than battery. • Initial step up was higher than expected (initial board level).

Varying Speed Front left hip ry Blue = Actual Red = Desired

Saturation: Front left hip ry Blue = Actual Red = Desired Blue = Motor Red = Is_Saturated

Slow Speed: Front left hip ry Blue = Actual Red = Desired Blue = Motor Red = Saturated?

Fixes • Manipulate clock (works for static walking) • Bang-bang-servo (allows dynamic locomotion).

Power Supply Axed To Avoid Further Errors (Secondary Reinforcer For Dog)

Test 4

Test 4: What we learned • Need to be robust to vehicle variation: • Fore/aft effective center of mass (tipping back) • Side/side effective center of mass (squatting) • Leg swing

Plans For Future Development • Learn To Make Better Plans • Learn To Plan From Observation • Memory-Based Policy Learning • Dynamic Locomotion

Planning: What can be learned? • Better primitives to plan with • Better robot/environment models • Planning parameters • Better models of robot capabilities • Better terrain and action cost functions • Better failure models and models of risk • Learn how to plan: bias to plan better and faster • How: Policy search, parameter optimization, …

Learn To Make Better Plans • It takes several days to manually tune a planner. • We will use policy search techniques to automate this tuning process. • The challenge is to do it efficiently.

Learn To Plan From Observation • Key issue: Do we learn cost functions or value functions?

Learn Cost Functions: Maximum Margin Planning (MMP) Algorithm • Assumption: cost function is formed as a linear combination of feature maps • Training examples: Run current planner through a number of terrains and take resulting body trajectories as example paths

Linear combination of features tree detector open space smoothed trees slope w1 w4 w2 w3

MMP Algorithm Until convergence do • Compute cost maps as linear combination of features • Technical step: slightly increase the cost of cells you want the planner to plan through • Makes it more difficult for the planner to be right • Train planner on harder problems to ensure good test performance • Plan through these cost maps using D* • Update based on mistakes: • If planned path doesn't match example then • Raise cost of features found along planned path • Lower cost of features found along example path

MMP Algorithm Properties • Algorithm equivalent to a convex optimization problem => no local minima • Rapid (linear) convergence • Maximum margin interpretation (via the "loss-augmentation" step 2) • Machine learning theoretic guarantees in online and offline settings • Can use boosting to solve feature selection

Learned Cost Maps

Learn the Value function Build a Planner Key Issue:Two Approaches to Control

Why Values? • Captures the entire cost-to-go: follow the value-function with just one-step look ahead for optimality (no planning necessary) • Learnable in principle: use regression techniques to approximate cost-to-go

Why Plans? • In practice, very hard to learn useful value-functions • High dimensional: curse of dimensionality • Value features don’t generalize well to new environments • Hard to build in useful domain knowledge • Instead, can take planning approach • Lots of domain knowledge • Costs *do* generalize • But: • computationally hard-- curse of dimensionality strikes back

Hybrid Algorithm Space of values: high dimensional • A new extension of Maximum Margin Planning to do structured regression: predict values with a planner in the loop Learned Linear combination Planner “Value” Features Learned Space of costs

Proto-results • Demonstrated an earlier algorithm (MMPBoost) on learning a heuristic • Provided orders of magnitude faster planning • Adapting now to the higher dimensional space of footsteps instead of heuristic • Hope to bridge the gap: reinforcement learning/value-function approximation with the key benefits of planning and cost-functions

Memory-Based Policy Learning • 1. Remember plans (all have same goal). • 2. Remember refined plans. • 3. Remember plans (many goals) – need planner to create policy. • 4. Remember experiences – need planner to create policy. • We are currently investigating option 1. We will explore options 2, 3, and 4.

Plan Libraries • Combine local plans to create global policy. • Local planners: decoupled planner, A*, RRT, DDP, DIRCOL. • Remember refined plans, experiences

Forward Planning To GenerateTrajectory Library Trajectory Library Single A* search

A Plan Library For Little Dog

Commands Remembering Refined Plans Errors Before After

Future Tests

Test Recap

Test Recap

Presentation Transcript

recap.

Recap

Recap

Test Recap

Recap

AP Rhet Test Recap

RECAP

Recap

Recap

Recap

Recap…

RECAP

Recap

RECAP

RECAP

Recap:

Recap

Recap

RECAP

Recap