1 / 38

Kshitij Judah, Alan Fern, Tom Dietterich

Active Imitation Learning via State Queries. Kshitij Judah, Alan Fern, Tom Dietterich. School of EECS, Oregon State University. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: . Preliminaries.

harlan
Download Presentation

Kshitij Judah, Alan Fern, Tom Dietterich

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Active Imitation Learning via State Queries Kshitij Judah, Alan Fern, Tom Dietterich School of EECS, Oregon State University TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.:

  2. Preliminaries • A Markov Decision Process (MDP) is a tuple where • is the set of states • is the set of actions • is the transition function denoting probability of transitioning to state after taking action in • is the reward function giving reward in state • is the initial state • A stationary policy is a mapping from states to actions • The H-horizon value of a policy is the expected total reward of trajectories that start at and follow for H steps

  3. Passive Imitation Learning Trajectory Data Supervised Learning Algorithm Classifier Learner Teacher GOAL:To learn a policy whose H-horizon value is not much worse than

  4. Passive Imitation Learning Trajectory Data Supervised Learning Algorithm Classifier Learner Teacher • DRAWBACK: • Generating such trajectories can be tedious and may even be impractical. Real-time low-level Control of multiple Game agents!!

  5. Active Imitation Learning via State Queries Teacher correct action to take in is Simulator Learner Select Best State Query Current Training data (s, a) pairs

  6. Active Imitation Learning via State Queries Teacher This is a bad state which I would never visit!! I choose not to suggest any action Simulator Learner Select Best State Query Bad State( ) Current Training data (s, a) pairs

  7. Bad State Response Wargus Expert Simulator Select Best State Query A bad state query!! Wargus Agent Bad State( ) Current Training data (s, a) pairs

  8. Bad State Response Expert Pilot Simulator Select Best State Query A bad state query!! Helicopter Flying Agent Bad State( ) Current Training data (s, a) pairs

  9. Bad State Response Teacher • It is important to minimize bad state queries!! correct action to take in is Simulator Challenge: how to combine action uncertainty and bad-state likelihood Select Best State Query Learner Current Training data (s, a) pairs We provide a principled approach based on noiseless Bayesian active learning

  10. Relation to Passive Imitation Learning • It is possible to simulate passive imitation learning via state queries N Trajectory Data Supervised Learning Algorithm

  11. Relation to I.I.D. Active Learning Teacher correct action to take in is Simulator Select Best State Query Single known target distribution Learner Current Training data (s, a) pairs

  12. Relation to I.I.D. Active Learning Teacher correct action to take in is Simulator Select Best State Query Single known target distribution Learner Current Training data (s, a) pairs • Applying i.i.d. active learning uniformly over entire state space leads to poor performance: Queries are in uncertain states that are also bad!!

  13. Noiseless Bayesian Active Learning (BAL) Hypotheses Tests Test Outcomes • Goal: identify true hypothesis with as few tests as possible • We employ a form of generalized binary search (GBS) in this work

  14. BAL for Deterministic MDPs

  15. BAL for Deterministic MDPs

  16. BAL for Deterministic MDPs

  17. BAL for Deterministic MDPs GOAL:Determine the path corresponding to by performing test from that have outcomes (teacher responses)

  18. BAL for Deterministic MDPs

  19. BAL for Deterministic MDPs

  20. BAL for Deterministic MDPs

  21. BAL for Deterministic MDPs

  22. BAL for Deterministic MDPs

  23. BAL for Deterministic MDPs

  24. BAL for Deterministic MDPs

  25. BAL for Deterministic MDPs

  26. BAL for Deterministic MDPs

  27. Imitation Query-by-Committee (IQBC) for Large MDPs Labeled Data (s, a) pairs Bootstrap Sample K Bootstrap Sample 1 Bootstrap Sample 2 Bootstrap Sample 3 Supervised Learner Supervised Learner Supervised Learner Supervised Learner Simulator Simulator Simulator Simulator Path 1 Path 2 Path 3 Path K Generalized Binary Search

  28. Action Uncertainty versus Bad-State Trade-off • can be rewritten in the following form: Posterior prob. mass of hypotheses that go through s Entropy of multinomial distribution over actions at s Posterior prob. of target policy visiting s Small bonus that is maximized when Uncertainty over action choices at s

  29. Stochastic MDPs • We use Pegasus style determinization approach to handle stochastic MDPs (Ng & Jordan, UAI 2000) • Details are in the paper!!

  30. Experiments • We performed experiments in two domains: • A grid world with pits • Cart pole • We compared IQBC against following baselines : • Random: • Selects states to query uniformly at random • Standard QBC (SQBC): • Treats all states as i.i.d. and applies standard uncertainty based QBC • Passive imitation learning (Passive): • Simulates standard passive imitation learning • Confidence based autonomy (CBA) (Chernova & Veloso, JAIR 2009): • Executes policy until the confidence falls below an automatically adjusted threshold, at which point the learner queries the teacher for an action, updates its policy and threshold and resumes execution • Performance can be quite sensitive to threshold adjustment

  31. Grid World With Pits Pit Pit 30 Pit Pit Goal 30

  32. Teacher Types • Generous: always responds with an action • Strict: declares states far away from the states visited by the teacher as bad states

  33. Grid World With Pits: Results “Generous” teacher

  34. Grid World With Pits: Results “Strict” teacher

  35. Cart Pole • state = • Actions = left or right • Bounds on cart position and pole angle are [-2.4, 2.4] and [-90, 90] resp.

  36. Cart Pole: Results “Generous” teacher

  37. Cart Pole: Results “Strict” teacher

  38. Future Work • Develop policy optimization algorithms that take responses and other forms of teacher input • Query short sequence of states rather than single states • Consider more application areas like structured prediction and other RL domains • Conduct studies with human teachers

More Related