10 likes | 158 Views
Solution Approach. RL via Practice and Critique Advice Kshitij Judah, Saikat Roy, Alan Fern and Tom Dietterich. Earlier Work. Expected Utility U( θ ,T). Reinforcement Learning. High level rules as advice for RL
E N D
Solution Approach RL via Practice and Critique Advice Kshitij Judah, Saikat Roy, Alan Fern and Tom Dietterich Earlier Work Expected Utility U(θ,T) Reinforcement Learning • High level rules as advice for RL • In the form of programming-language constructs (Maclin and Shavlik 1996), rules about action and utility preferences (Maclin et al. 2005) • Logical rules derived from a constrained natural language (Kuhlmann et al. 2004) Simulator Advice Interface • We use likelihood weighting to estimate the utility U(,T) of policy using off-policy trajectories T (Peshkin and Shelton 2002). Let be the probability of generating trajectory and let be the parameters of the policy that generated . An unbiased utility estimate is given by: • where • The gradient of has a compact closed form. • Critique data: • PROBLEM: RL takes a long time to learn a good policy. • Trajectory data: Choose Teacher • DESIDERATA: • Non-technical users as teachers • Natural interaction methods • Learning by Demonstration (LBD) • User provides full demonstrations of a task that the agent can learn from (Billard et al. 2008). • Recent works (Coates, Abbeel, and Ng 2008) include model learning to improve on demonstrations but does not allow users to provide feedback. • Argall, Browning, and Veloso 2007; 2008 combines LBD and human critiques on behavior (similar to our work here), but there is no autonomous practice. RL via Practice + Critique Advice advice behavior state Environment Simulator action Set of all optimal actions All actions are equally good C reward Get Critique • Real time feedback from User • TAMER framework (Knox and Stone, 2009) uses a type of supervised learning to predict the user’s reward signal, and then select actions to maximize predicted reward. • Thomaz and Breazeal (2008) rather combine the end-user reward signal with the environmental reward, and use Q-Learning. • ALL likelihood: Estimate Utility Optimization using Gradient Ascent RESEARCH QUESTION:Can we make RL perform better with some outside help, such as critique/advice from teacher and how? • How to pick the value of λ? • What are the forms of U and L? Any action not in O(s) is suboptimal Simulator T Get Experience • Features: • Allows feedback and guidance advice • Allows practice • Novel approach to learn from critique advice and practice Learn Act Problem: Given data sets T and C, how can we update the agent’s policy so as to maximize its reward? • Critique Data Loss L(θ,C) Expected Any-Label Learning Choosing λ using Validation Reality: there does not exist an ideal teacher!!! • Key idea: define a user model that induces a distribution over ALL problems. • User model: distribution over sets given critique data, assume: independence among different states. Ideal Teacher • We introduce two noise parameters and , and one bias parameter (probability that an unlabeled action is in O(s) ). • Learning Goal: find a probabilistic policy that has a high probability of returning an action in O(s) when applied to s. It is not important which action is selected as long as the probability of selecting an action in O(s) is high. • Expected ALL likelihood: • Closed form of likelihood: • We call this problem Any Label Learning (ALL) • The Multi-Label Learning problem (Tsoumakas and Katakis 2007) differs in that the goal is to learn a classifier that outputs all of the labels in sets and no others. • User Studies Experimental Setup Advice Patterns of Users Simulated Experiments • The user study involved 10 end-users • 6 with CS backgrounds and 4 without a CS background. • For each user, the study consisted of teaching both the pure supervised and the combined systems, each on a different map, for a fixed amount of time. (Supervised: 30 mins, Combined: 60 mins) • Our Domain: RTS tactical micro-management • 5 friendly footmen versus 5 enemy footmen (Wargus AI). • Difficulty: Fast pace and multiple units acting in parallel • Our setup: Provide end-users with an interface that allows to watch a battle and pause at any moment. The user can then scroll back and forth within the episode and mark any possible action of any agent as good or bad. • Available actions for each military unit are to attack any of the units on the map (enemy or friendly) giving a total of 9 actions per unit. • Two battle maps, which differed only in the initial placement of the units. • Evaluated 3 Learning Systems: • Pure RL: Only practice • Pure Supervised: Only critique • Combined System: Critique + Practice • Goal: Test learning capability with varying amount of critique and practice data • Total of 2 users per map. • For each user: divide critique data into 4 equal sized segments creating four data-sets per user containing 25%, 50%, 75%, and 100% of their respective critique data. • We provided the combined system with each of these data sets and allowed it to practice for 100 episodes. Supervised • Positive (or negative) advice is where the user only gives feedback on the action taken by the agent. • Mixed is where the user not only gives feedback on the agent's action but also suggests alternative actions to the agent. Combined • These results show that the users were able to significantly outperform pure RL using both the supervised and combined systems. • The end-users had slightly greater success with the pure supervised system versus the combined system: • Large delay experienced while waiting for the practice stages to end • Policy returned by practice was sometimes poor, ignored advice • Lesson Learned: • Such behavior is detrimental to the user experience and overall performance. • Future Work: • Better behaving combined system • Studies where users are not captive during practice stages • Frustrating Fraction of positive, negative and mixed advice. Map 2 Advice Interface Map 1