230 likes | 356 Views
Incorporating Advice into Agents that Learn from Reinforcement. Presented by Alp Sardağ. Problem of RL. Reinforcement Learning usually requires a large number of trainning episodes. HOW OVERCOME? Two approachs: Implicit representation of the utility function
E N D
Incorporating Advice into Agents that Learn from Reinforcement Presented by Alp Sardağ
Problem of RL • Reinforcement Learning usually requires a large number of trainning episodes. HOW OVERCOME? • Two approachs: • Implicit representation of the utility function • Allowing Q-learner to accept given advice at any time and, in a natural manner.
Input Generalization • To learn how to play game (chess 10120 states), impossible to visit all these states • Implicit representation of the function: A form that allows to calculate the output for any input, much more compact than tabular form. Example: U(i) = w1f1(i)+w2f2(i)+...+wnfn(i)
Connectionist Q-learning • As the function to be learned is characterized by a vector of weights w, neural networks are obvious candidates for learning weights. The new update rule: w w + (r+Uw(j)-Uw(i))wUw(i) Note: TD-gammon learned better than Neurogammon
Example of Advice Taking Advice: Don’t go into box canyons when opponents are in sight.
General Structure of RL learner A connectionist Q-learning augmented with advice-taking.
Connectionist Q-learning • Q(a,i) : utility function maps state and actions to numeric values. • Given a perfect version of this function the optimal plan is to simply choose, in each state that is reached, the action with the maximum utility. • The utility function is implemented as a neural network, whose inputs describe the current state and whose outputs are the utility of each action.
Step 1 in Taking Advice • Request the advice: Instead of having the learner request the advice, the external observer provides advice whenever the observer feels it is appropriate. There are two reasons for this: • It places less burden on the observer • It is an open question how to create the best mechanism for having a RL agent recognize its needs for advice.
Step 2 in Taking Advice • Convert the advice to an internal representation: Due to the complexities of natural language processing, the external observer express its advice using a simple programming language and a list of task-specific terms. • Example:
Step 3 in Advice Taking • Convert the advice into usable form: Using techniques from knowledge compilation, a learner can convert high level advice into a collection of directly interpretable statements.
Step 4 in Advice Taking • Use ideas from knowledge-based neural networks: Install the operationalized advice into the connectionist representation of the utility function. • Converts a ruleset into a network by mapping the “target concepts” of the ruleset to output units and creating hidden units that represent the intermediate conclusion. • Rules are intalled incrementally installed into networks.
Example Cont. Advice added, note that the inputs and outputs to the network remain unchanged; the advice only changes how the function from states to the utility of actions is calculated.
Example Cont. A multistep plan:
Example Cont. A multistep plan embedded in a REPEAT:
Example Cont A dvice that involves previously defined terms:
Judge the Value of Advice • Once the advice is inserted, the RL agent returns to exploring its environment, thereby integrating and refining the advice. • In some circumstances, such as game learner that can play against itself, it would be straightforward to empirically evaluate the advice. • It would also be possible to allow the observer to retract or counteract bad advice.
Test Bed Test environment: (a) sample configuration (b) sample division Of the environment into sectors (c) distance measured by the agent sensors (d) A neural network that computes utility of actions.
Methodology • The agents are trained for a fixed number of episodes for each experiment. • An episode consists of placing the agent into a randomly generated, initial environment, and then allowing it to explore until it is captured or a treshold of 500 step is reached. • Environment contains 7x7 grid 15 obstacles, 3 enemy agents, and 10 rewards. • 3 random generated environment. • 10 randomly initialized network. • Average total reinforcement is measured by freezing the network and measuring the average reinforcement on a testset.
Testset Result Above table shows how well each piece of advice meets its intent.
Related Work Gordon and Subramanian (1994) developed a system similiar to that one. The agent accept high-level advice of the form IFcondition THEN ACHIEVE goal. It operationalizes these rules using its background knowledge about goal achievement. The resulting rules are then refined using genetic algorithms.