260 likes | 422 Views
Hybrid Agent-Based Modeling: Architectures,Analyses and Applications (Stage One). Li, Hailin. Outline. I ntroduction Least-Squares Method for Reinforcement Learning Evolutionary Algorithms For RL Problem (in progress)
E N D
Hybrid Agent-Based Modeling: Architectures,Analyses and Applications(Stage One) Li, Hailin
Outline Introduction Least-Squares Method for Reinforcement Learning Evolutionary Algorithms For RL Problem (in progress) Technical Analysis based upon hybrid agent-based architecture (in progress) Conclusion (Stage One)
Introduction • Learning From Interaction • Interact with environment • Consequences of actions to achieve goals • No explicit teacher but experience • Examples • Chess player in a game • Someone prepares some food • The actions of a gazelle calf after its born
Introduction • Characteristics • Decision making in uncertain environment • Actions • Affect the future situation • Effects cannot be fully predicted • Goals are explicit • Use experience to improve performance
Introduction • What to be learned • Mapping from situations to actions • Maximizes a scalar reward or reinforcement signal • Learning • Does not need to be told which actions to take • Must discover which actions yield most reward by trying
Introduction • Challenge • Action may affect not only immediate reward but also the next situation, and consequently all subsequent rewards • Trial and error search • Delayed reward
Introduction • Exploration and exploitation • Exploit what it already knows in order to obtain reward • Explore in order to make better action selections in the future • Neither can be pursued exclusively without failing at the task • Trade-off
Introduction • Components of an agent • Policy • Decision-making function • Reward (Total reward, Average reward, Discounted reward) • Good and bad events for the agent • Value • Rewards in a long run • Model of environment • Behavior of the environment
Introduction • Markov Property & Markov Decision Processes • “Independence of path”:all that matters is in the current state signal • A reinforcement learning task that satisfies the Markov property is called a Markov decision process, MDP • Finite Markov Decision Process (MDP)
Introduction • Three categories of methods for solving the reinforcement learning problem • Dynamic programming • Complete and accurate model of the environment • A full backup operation on each state • Monte Carlo methods • A backup for each state based on the entire sequence of observed rewards from that state until the end of the episode • Temporal-difference learning • Approximate the optimal value function, and to view the approximation as an adequate guide
LS Method for Reinforcement Learning • For stochastic dynamic system : Control decision generated by policy : Current State : Disturbance independently sampled from some fixed distribution MDP can be denoted by a quadruple : state transition probability : Action Set : State Set : The policy is a mapping : denotes the reward function is a Markov chain
LS Method for Reinforcement Learning • For each policy , the value function is defined by equation: The optimal value function is defined by
LS Method for Reinforcement Learning The optimal action can be generated through Introducing Q value function Now the optimal action can be generated through
LS Method for Reinforcement Learning • The exact Q-values for all state-action pairs can be obtained by solving the Bellman equations (full backups): or, in matrix format: denotes the transition probability from to
LS Method for Reinforcement Learning • Traditional Q-learning Popular variant of temporal-difference learning to approximate Q value functions. In the absence of the model of the MDP, using sample data The temporal difference is defined as: Consider one-step Q-learning, the updated equation is:
LS Method for Reinforcement Learning The final decision base upon Q-learning: The reason for the development of approximation methods: • Size of state-action space • The overwhelming requirement for computation The categories of approximation methods for Machine Learning: • Model Approximation • Policy Approximation • Value Function Approximation
LS Method for Reinforcement Learning • Model-Free Least-Squares Q-learning Linear Function Approximator : Basis Functions : A vector of scalar weights
LS Method for Reinforcement Learning • For a fixed policy is matrix and If the model of MDP is available
LS Method for Reinforcement Learning The policy where and If the model of MDP is not available: Model-Free Given Samples
LS Method for Reinforcement Learning • Optimal policy can be found: The greedy policy is represented by the parameter and can be determined on demand for any given state.
LS Method for Reinforcement Learning • Simulation • System is hard to model but easy to simulate • Implicitly indicate the features of the system in terms of the state visiting frequency • Orthogonal least-squares algorithm for training an RBF network • Systematic learning approach for solving center selection problem • Newly added center always maximizes the amount of energy of the desired network output
LS Method for Reinforcement Learning Hybrid Least-Squares Method Action State Reward Simulation & Orthogonal Least-Squares regression Environment Feature Configuration Least-Squares Policy Iteration (LSPI) algorithm Optimal policy
Simulation Cart-Pole System
Conclusion(Stage One) • From Reinforcement learning perspective, the intractability of solutions to sequential decision problems requires value function approximation methods • At present, linear function approximators are the best alternatives as approximation architecture mainly due to their transparent structure. • Model-free least squares policy iteration (LSPI) method is a promising algorithm that uses linear approximator architecture to achieve policy optimization in the spirit of Q-learning. May converge in surprising few steps • Inspired by orthogonal least-squares regression method for selecting the centers of RBF neural network, a new hybrid learning method for LSPI can produce more robust and human-independent solution.