290 likes | 398 Views
Reward Functions for Accelerated Learning. Presented by Alp Sardağ. Why RL?. RL a methodology of choice for learning in a variety of different domains. Convergence property. Potential biological relevance. RL is good in: Game playing Simulations. Cause of Failure.
E N D
Reward Functions for Accelerated Learning Presented by Alp Sardağ
Why RL? • RL a methodology of choice for learning in a variety of different domains. • Convergence property. • Potential biological relevance. • RL is good in: • Game playing • Simulations
Cause of Failure • Fundamental assumption of RL models, the belief that the agent-environment interaction can be modeled as a MDP. • A and E are synchronized finite state automata. • A and E interact in discrete time intervals. • A can sense the state of E and use it to act. • After A acts, E transitions to a new state. • A receives a reward after performing an action.
States vs. Descriptors • Traditional RL depend on accurate state information, where as in physical robot environments: • Even for simplest agents state space is very large. • Sensor inputs are noisy • The agent usually percieve local
Transitions vs. Events • World and agent states change asynchronously, in response to events not all are caused by the agent. • Same event can vary in duration under different circumstances and have different consequences. • Nondeterministic and stochastic models are more close to real world. However, the information for establishing a stochastic model is not usually available.
Learning Trials • Generating a complete policy requires a search in a large size of the state space. • In real world, the agent cannot choose what states it will transition to, and cannot visit all states. • Convergence in real world depends only on focusing only on the relevant parts of state. • The better the problem formulated, fewer learning trials.
Reinforcement vs. Feedback • Current RL work uses two types of reward: • Immediate • Delayed • Real world situations tend to fall in between the two popular extremes. • Some immediate rewards • Plenty of intermittent rewards • Few very delayed rewards
Multiple Goals • Traditional RL deal with specialized problems in which the learning task can be specified with a single goal. The problems: • Very specific task learned • Conflicts with any future learning • The extension: • Sequentially formulated goals where state space explicitly encode what goals reached so far. • Use separate state space and reward function for each goal. • W-learning: competition among selfish Q-learners.
Goal • Given the complexity and uncertanity of real world domains, a learning model, that minimizes the state space and maximizes the amount of learning at each trial.
Intermediate Rewards • Interminent rewards can be introduced : • Reinforcing multiple goals, by using progress estimators. • Heterogenous Reinforcement Function: In real worlds multiple goal exists, it is natural to reinforce individually rather than a monolithic goal.
Progress Estimators • Partial internal critics associated with specific goal, provide a metric of improvement relative to those goal. They are importanat in noisy worlds: • Decrease the learner’s sensitivity to intermittent errors. • Encourage the exploration, without them, the agent can trash repeadetly attempting inappropriate behaviors.
Experimental Design • To validate the proposed approach, experiments designed for comparing new RL with traditional RL. • Robots • Learning Task • Learning Algorithm • Control Algorithm
Robots • In the experiments four fully autonomous R2 mobile robots consisting of: • Differentially steerable • Gripper for lifting objects • Piezo-electric bump sensor for detecting contact-collisions and monitoring the grasping force. • Set of IR for obstacle avoidance. • Radio tranceivers, used for determining absolute posiiton.
Robot Algorithm • The robots are programmed in the behavior language: • Based on the subsumption architecture. • Parallel control system formed concurrently active behaviors, some of which gather information, some drive effectors, and some monitor progress and contribute reinforcement.
The Learning Task • The learning task consists of finding a mapping of all conditions and behaviors into the most efficient policy for group foraging. • Basic behaviors from which to learn behavior selection: • Avoiding • Searching • Resting • Dispersing • Homing
The Learning Task Cont. • The state space can be reduced to the cross-product of the following state variables: • Have-puck? • At-home? • Near-intruder? • Night-time?
Learning Task Cont. • Instinctive behaviors because learning them has a high cost: • As soon as robot detects a puck between its fingers, it grasps it. • As soon as the robot reaches the home region, it drops a puck if ti is carrying one. • Whenever the robot is too near an obstacle, it avoids.
The learning Algorithm • The algorithm produces and maintains a matrix where appropriatness of behaviors associated with each state is kept. • The values in the matrix fluctuates over time based on received reinforcement, and are updated asynchronously, with any received reward.
The Learning Algorithm • The algorithm sums the reinforcement over time: • The influence of the different types of feedback was weighted by the values of feedback constant:
The Control Algorithm • Whenever an event is detected, the following control sequence is executed: • Appropriate reinforcement delivered for current condition-behavior pair, • The current behavior is terminated, • Another behavior is selected. • Behaviors are selected according to the following rule: • Choose an untried behavior if one is available. • Otherwise choose best behavior
Experimental Results • The following three approaches are compared: • A monolithic single-goal (puck delivery to the home region) reward function using Q-learning, R(t)=P(t) • A heterogeneous reinforcement function using multiple goals: R(t)=E(t), • A heterogeneous reinforcement function using multiple goals and two progress estimator function: R(t)=E(t)+I(t)+H(t)
Experimental Results • Values are collected twice per minute. • The final learning values are collected after 15 minute run. • Convergence is defined as relative ordering of condition-behavior pairs.
Evaluation • Given the undeterminism and noisy sensor inputs the single goal provides insufficient feedback. It was vulnerable to interference. • The second learning strategy outperforms second because it detects the achievement of subgoals on the way of top level goal of depositing pucks at home. • The complete heterogenous reinforcement and progress estimator outperforms the others because it uses of all available information for every condition and behavior.
Additional Evaluation • Evaluated each part of the policy separately, according the following criteria: • Number of trials required, • Correctness, • Stability. • Some condition-behavior pairs proved to be much more difficult to learn than others: • without progress estimator • rare states
Discussion • Summing reinforcement • Scaling • Transition models
Summing Reinforcement • Allows for oscillations. • In theory, the more reinforcement the faster the learning. In practice noise and error could have the opposite effect. • The experiments described here demonstrate that even with a significant amount of noise, multiple reinforcers and progress estimators significantly accelerate learning.
Scaling • Interference was detriment to all three approach. • In terms of the amount of time required,The learned group foraging strategy outperformed hand-coded greedy agent strategies. • Foraging can be improved further by minimizing interference. Only one robot move at a time.
Transition Models • In case of noisy and uncertain environments transition model is not available to aid the learner. • The absence of a model made it difficult to compute discounted future reward. • Future work: applying this approach to problems that involve incomplete and approximate state transition models.