530 likes | 1k Views
REINFORCEMENT LEARNING. Group 11 Ashish Meena 04005006 Rohitashwa Bhotica 04005010 Hansraj Choudhary 04d05005 Piyush Kedia 04d05009 . Outline. Introduction Learning Models Motivation Reinforcement Learning Framework Q – Learning Algorithm
E N D
REINFORCEMENT LEARNING Group 11 Ashish Meena 04005006 Rohitashwa Bhotica 04005010 Hansraj Choudhary 04d05005 Piyush Kedia 04d05009
Outline • Introduction • Learning Models • Motivation • Reinforcement Learning Framework • Q – Learning Algorithm • Applications • Summary
Introduction • Machine Learning • Construction of programs that automatically improve with experience. • Types of Learning • Supervised Learning • Unsupervised Learning • Reinforcement Learning
Supervised Learning - Training data: (X,Y). (features, label) - Predict Y, minimizing some loss. - Regression, Classification. • Example • Predict whether a patient, hospitalized due to a heart attack, will have a second heart attack. The prediction is to be based on demographic, diet and clinical measurements for that patient. (Logistic Regression)
Unsupervised Learning • Unsupervised Learning - Training data: X. (features only) - Find “similar” points in high-dim X-space. - Clustering. • Example • From the DNA micro-array data, determine which genes are most “similar” in terms of their expression profiles. (Clustering)
Reinforcement Learning • Training data: (S, A, R). (State-Action-Reward) • Develop an optimal policy (sequence of decision rules) for the learner so as to maximize its long – term reward.
State Reward Action Reinforcement Learning Agent Policy Environment
Motivation • Supervised and unsupervised learning fail in many situations. • Example • FLIGHT CONTROLS SYSTEMS • For a set of all sensor readings at a given time deciding how the flight controls should function. • In case of supervised learning the labels for many features are unknown. • Performing trial-and-error interactions with the environment reinforcement learning is capable of solving such problems.
Learning by Interaction • Learning to ride a bicycle • Actions: • Turn handle bars RIGHT. • Turn handle bars LEFT. • Rewards: • Positive if the cycle is perfectly balanced. • Negative if the angle between the cycle and the ground decreases
Inspiration from Natural Learning • Dopamine • Neurohormone occurring in a wide variety of animals, including both vertebrates and invertebrates • Regulates and controls behavior by inducing pleasurable effects equivalent to rewards • Motivates us to proactively to perform certain activities
Markov Decision Processes • A discrete set of environment states S. • A discrete set of agent actions A. • At each discrete time agent observes state stЄS and chooses action st ЄA • Receives reward rt • State changes to st+1 • Markov Assumptions • st+1 = δ(st,at) • rt = r(st,at) • δand r are usually unknown to the learner • δand r may also be non deterministic
Model-free v/s Model-based Methods • Model-free Methods • Eliminate need to know the model • Learn a policy without estimating the model. • Example Q Learning • Model Based Methods • Model consists of the Transition Probability Function and the Reward Function. • Interactively estimate the model and calculate the policy. • Example Dyna 9/1/2014 13
Precise Goal Maximization of long-term reward. An optimal policy has to be learnt in order to do this. Long-term reward may have different interpretations according to how the future is taken into account. i.e. different models of optimal behaviour
Models of Optimal Behaviour • Three main models • Finite Horizon Model • Optimize the reward for the next h steps rt represents the reward for the (t+1)thaction performed • h=1 represents a greedy strategy
Models of Optimal Behaviour (2) • Infinite Horizon Discounted Model • Models future rewards being less valuable than in the present • Has infinite horizon i.e. considers infinite actions in the future • Makes use of a discount factor g between 0 and 1 which represents the importance of the future. • This is the model used in the Q-Learning algorithm
Models of Optimal Behaviour (3) Average-Reward Model Reward per action is considered Infinite future is taken into account Rewards in the future are equally valuable
The Learning Task • Execute actions, observe results and • Learn a policy, π : S → A maximizing • for all states S • Infinite horizon discounted model is being used here
Value Function • For each possible policy πthe agent can adopt the value function is • In terms of the value function the learning task can be reformulated to learn the optimal policy p* such that
Learning the Value Function • The value of the optimal V written as V * (s) = max V π (s) • A look-ahead search can be performed from any state to determine the optimal policy using a π*(s) = argmax [r(s, a) + gV*(d(s, a))] • Learning V* is only possible when the agent has perfect knowledge of both d and r. • Thus need to define the Q function arises
The Q Function • The evaluation function Q(s, a) is defined as Q(s, a) = r(s, a) + g V*(d(s, a)) • If the Q-function is learnt the optimal action can be selected without knowing d and r.
Learning the Q function • Notice the relationship between Q and V* • Rewriting the value of Q
Learning the Q-value • FOR each <s, a> DO • Initialize table entry: • Observe current state s • WHILE (true) DO • Select action a and execute it • Receive immediate reward r • Observe new state s’ • Update table entry for using the training rule: • Move: record transition from s to s’
Q Learning Illustration • Consider that an optimal strategy is being learnt for the given set of states and rewards for actions • The reward for all actions is 0 unless it is moving to the goal state. 0 100 G 0 0 0 0 0 100 0 0 0 0
Q Learning Illustration(2) • The actual values of V and Q taking g=0.9 are:
Learning Step • Since an absorbing state exists learning will consist of a series of episodes with a random start state
Uncertain State Transitions • State transition probability function is used • Denotes the transition probability from State s to s’ when action a is performed. • Q function is modified to
Applications • Cell Phone Channel Allocation • Cobot: A Social Reinforcement Learning Agent • Car Simulation: Using Reinforcement Learning • Network Packet Routing • Elevator Scheduling • Use of RL to improve the performance of natural language question answering systems • Reinforcement Learning Methods for Military Applications 9/1/2014 29
Cell Phone Channel Allocation • Learns channel allocations for cell phones • Channels are limited • Allocations affect adjacent cells • Want to minimize dropped and blocked calls 9/1/2014 30
Cell Phone Channel Allocation Cont… • States • Occupied and unoccupied channels for each cell • Availability: Number of free channels for cell • Actions • Call arrival • Evaluate possible free channels • Assign one that has highest value • Call termination • Free channel • Consider reassigning each ongoing call to just-released channel • Perform reassignment (if any) with highest value • Rewards and Values • Reward is number of on-going calls 9/1/2014 31
Performance of FA, BDCL, and RL FA=fixed assignment method (FA) BDCL=borrowing with directional channel locking
Cobot: A Social Reinforcement Learning Agent • Cobot is a software agent • Apply RL in a complex human online social chat based environment • The goal is to interact with other members and to become a real part of his social fabric • Takes certain actions under his own initiative • Any user can reward or punish him. • Cobot has a incremental database of “social statistics” • e.g. how frequently and in what ways users interacted with one another provided summaries of these statistics as a service
Cobot Cont… States One state space corresponding to each user State space contains a number of features containing statistics about that particular user Actions Null Action Choose to remain silent for this time period. Topic Starters Introduce a conversational topic Social Commentary Make a comment describing the current social state of the LivingRoom, such as “It sure is quiet” or “Everyone here is friendly.”
Cobot cont… The RL Reward Function Reward verb E.g. hug Punish verb E.g. spank These verbs give a numerical (positive and negative, respectively) training signal to Cobot
Car Simulation: Using Reinforcement Learning The drivers do not know the track information beforehand Takes appropriate action to avoid bumping into the wall by learning from the past experience and the given rewards States state of the car is represented by two variables The distance of the car to the left wall of the track the car’s velocity towards right wall
Car Simulation cont… Action Turn left or right to go in right direction Reward The car will be given a negative reward if Car bumping into the wall car going backwards Positive if Going on correct direction
Conclusion • The basic reinforcement learning model consists of: • a set of environment states S • a set of actions A and • a set of scalar "rewards" • Learner take actions in an environment so as to maximize its long-term reward • Any problem domain that can be cast as a Markov decision process can potentially benefit from this technique. • Unlike supervised learning, reinforcement learning systems do not require explicit input-output pairs for training 9/1/2014 38
References Reinforcement Learning: A User’s Guide Bill Smart Department of Computer Science and Engineering Washington University in St. Louis http://www.cse.wustl.edu/~wds/ICAC 2005 Machine Learning, Tom Mitchell, McGraw Hill, 1997. Harmon, M., Harmon, S.: Reinforcement Learning : A Tutorial, Wright State University, 2000. Rich Sutton: Reinforcement Learning: Tutorial, AT& T Labs http://www.cs.ualberta.ca/~sutton/Talks/RL-Tutorial/RL-Tutorial.ppt Kaelbling, L.P., Littman, M.L., and Moore, A.W. "Reinforcement learning: A survey". Journal of Artificial Intelligence Research, 4, 1996. http://en.wikipedia.org/wiki/Reinforcement_learning A Social Reinforcement Learning Agent by Charles Isbell, Christian Shelton, Michael Kearns, Satinder Singh and Peter Stone. In Proceedings of the Fifth International Conference on Autonomous Agents (AGENTS), pages 377-384, 2001. Winner of Best Paper Award 9/1/2014 39