370 likes | 545 Views
Control and Robotics Laboratory Electrical Engineering Faculty Technion. Hierarchical Solutions in Reinforcement Learning using Graph Algorithms Project Presentation. By Ben Ezair & Uri Wiener Instructor: Mr. Ishai Menache Winter 2004 /5. Agenda. •Motivation •Background
E N D
Control and Robotics Laboratory Electrical Engineering Faculty Technion Hierarchical Solutions in Reinforcement Learning using Graph Algorithms Project Presentation By Ben Ezair & Uri Wiener Instructor: Mr. Ishai Menache Winter 2004/5
Agenda • •Motivation • •Background • •Description of the algorithms • The domains & experimental results • Conclusions • Future work
Motivation • Many problems can be modeled as MDP’s • (Markov Decision Processes). • Reinforcement learning algorithms were designed in order to solve MDPs, when the environment model is unknown. • Q-learning is a popular algorithm within the Reinforcement Learning paradigm, guaranteed to asymptotically converge. • Yet, due to an enormous state-space, Q-learning performs poorly in many real-life tasks. • We will present ways to enhance the standard Q-learning algorithm using an hierarchical, graph-based approach.
reinforcement environment agent State action Reinforcement Learning • The Reinforcement learning framework: The agent explores the environment; the agent perceives its current state and takes actions. • The environment, in return, provides a reward (which can be positive or negative).
Q-learning • The Q-learning algorithm works by estimating the values - Q(s,a). • These values try to predict the payoffs that may be obtained by taking action a from state s. Q-values are estimated on the basis of experience as follows: 1. From the current state s, select an action a. This will cause a receipt of an immediate payoff r, and arrival at a next state s'. 2. Update Q(s,a) based upon this experience as follows: 3. Go to 1. Where: Learning rate: (0,1] Next state Discount factor: [0,1]
Description of policies • ε-greedy policy • Explore policy • The least explored action in current state is chosen. • Exploit policy • The action with the highest Qvalue is chosen. • With probability of 0 < ε < 1, use explore policy, • otherwise use exploit policy. • Throughout our experiments a ε value of 0.3 was used.
The use of options • 'Options‘, also known as “Macro-actions”, are sets of actions defined for multiple states in the state-space. They are designed to bring the agent to a certain state (or set of states). For example:
The K-cluster algorithm • Clustering algorithm • The algorithm aims at maximizing: • Where: • g() is a function that defines how well the two clusters are separated, • f() is a function that defines the quality of a cluster.
The K-cluster algorithm - continued • The function g may also account for Qvalue differences between the two clusters, making it more likely that clusters with similar Qvalue will be merged: For example:
The K-cluster algorithm - continued • We chose an approximation method in which we attempt to maximize the clustering score by removing its smallest element in each step. • This approximation dramatically reduces the complexity of the clustering process allowing us to deal with larger state-spaces.
Cut algorithm • Performed by running a max-flow/min-cut algorithm using a graph derived from the state space. • The algorithm examines the quality of the bottlenecks found according to a quality factor defined as: • If this quality factor exceeds a predetermined value, then options are set to reach the bottlenecks. Otherwise, no options are set and the cut algorithm should be run again later.
Cut algorithm - continued • Once the first cut is made successfully, we recursively call the cut algorithm separately for all states on either side of the bottlenecks. Example of a conversion of a maze into a graph: (both possible bottlenecks are highlighted)
Software implementation Block diagram of the software implementation we used:
Maze environments Six-pass maze: Step reward: 0 Bump wall reward: 0 Noise: 10% chance for random action Algorithm dependent parameters: Kcluster: Steps before calling algorithm: 3256 Clusters: 5 for Ni 10, 6 for Ni 0 Ni: 10, 0 Qcut: Steps before calling algorithm: 2000 Quality factor: 1000
Maze environments - continued Six-pass maze experimental results (averaged over 150 runs) Clusters & bottleneck: K-cluster, Ni=10, 5 clusters K-cluster, Ni=0, 6 clusters Q-cut bottlenecks
Maze environments - continued Six-pass maze experimental results (averaged over 150 runs) 1st state Qvalue: Steps to goal
Maze environments - continued Big maze: Step reward: 0 Bump wall reward: 0 Noise: 10% chance for random action Algorithm dependent parameters: Kcluster: Steps before calling algorithm: 42475 Clusters: 5 Ni: 10 Qcut: Steps before calling algorithm: 20000 Quality factor: 50000
Maze environments - continued Big maze experimental results (averaged over 150 runs) Clusters & bottleneck: K-cluster, Ni=10, 5 clusters Q-cut bottlenecks
Maze environments - continued Big maze experimental results (averaged over 150 runs) 1st state Qvalue: Steps to goal
R G Y B Taxi environment Standard taxi problem as introduced by Dietterich (2000). Step reward: 0 Bump wall reward: 0 Noise: 10% chance for random action Algorithm dependent parameters: Kcluster: Steps before calling algorithm: 11000 Clusters: 20 Ni: 10 Qcut: Steps before calling algorithm: 10000 Quality factor: 200
Taxi environment - continued Taxi experimental results (averaged over 150 runs) 1st state Qvalue: Steps to goal
Taxi environment - continued Taxi experimental results (averaged over 150 runs) Kcluster’s solution quality as function of the algorithm’s starting time:
Car-hill environment • The state-space is divided uniformly to a discrete 50x50 space. Algorithm dependent parameters: Qcut: Kcluster: Steps before calling algorithm: 20000 Quality factor: 1 (much too low to give good results) Steps before calling algorithm: 100000 Clusters: 12 Ni: 10
Car-hill environment - continued Car-hill experimental results (averaged over 150 runs) Clustering result: Note that higher speeds are towards the bottom of the figure, and positions closer to the goal are towards the right of the figure. Remark: Because of the transition-space magnitude of the problem, running the Q-cut algorithm with reasonable quality factor and initial conditions is not applicable.
Car-hill environment - continued Car-hill experimental results (averaged over 150 runs) 1st state Qvalue: 1st state Qvalue standard deviation:
Description of the ODE (Open Dynamics Engine) “The Open Dynamics Engine (ODE) is a free, industrial quality library for simulating articulated rigid body dynamics. For example, it is good for simulating ground vehicles, legged creatures, and moving objects in VR environments. It is fast, flexible and robust, and it has built-in collision detection. ODE is being developed by Russell Smith with help from several contributors". (taken from the Open Dynamics Engine user guide) More information on ODE: http://www.ode.org
Robot environments • We experimented with 3 robot environments: • 1. 2-link robot environment • 2. 3-link dynamic robot environment • 3. 3-link static robot environment • In all 3 environments, the robot must learn to stand up. Environment screenshots:
Robot environments - continued • Standing is achieved when the agent brings the robot to a position in which the joints' angles and angular speeds, as well as the angle between the bottom link and the ground are less than 0.05*PI (0.05*PI/sec for the speeds). • At that point, a discrete PD controller takes over and makes sure the robot keeps standing straight. • Conversion between angles/angular speeds and discrete values uses a resolution of 0.1*PI radians (or radians per second). • The agent controls the robot by giving the angular speed it wants each joint to have. An independent discrete proportional controller on each joint then tries to achieve this speed. • • An episode ends when the robot successfully stands up.
Robot environments - continued • Rewards • • The agent is rewarded when the angles and angular speeds fall below 0.1*PI (0.1*PI/sec for the speeds). • • If that is not the case than the agent is rewarded when it gets the one of the links to a certain height. • The agent is negatively rewarded by a larger amount when it loses these mid-goals.
2-link robot environment • The 2-link robot has 3 links; the bottom link is so massive that it's essentially a stationary object. • The problem starts with the bottom link already in an upright position and the two joints in some arbitrary angles. • The bottom link lacks the power needed to lift the two top links and the agent has to use momentum generated by the upper link to stand up. • The four state variables used for this environment are the angles and angular speeds of the two joints. Video clip:
2-link robot environment - continued 2-link robot experimental results (averaged over 50 runs) Kcluster parameters: Steps before calling algorithm: 214000 Clusters: 6 Ni: 10 1st state Qvalue: Steps per episode:
3-link dynamic robot environment • The five state space variables are the angles and angular speeds of the two joints as well as the angle between the bottom link and the ground. • The robot starts this problem lying down and must use leverage to get itself up. • The upper joint is very weak, the agent has to use momentum generated by the bottom joint to bring the two upper links into position. Video clip:
3-link dynamic robot environment 3-link dynamic robot experimental results (averaged over 10 runs) Kcluster parameters: Steps before calling algorithm: 50K, 14K Clusters: 6 Ni: 10 1st state Qvalue: Steps per episode:
Wall-ball environment • This environment resembles air hokey. • A bat positioned at the bottom of a rectangular area is able to move from side to side and is supposed to hit a ball, preventing it from falling to the bottom, trying to make it go through an opening at the top of the rectangular area. • The bat and walls have an infinite mass compared to the ball so all impacts are completely elastic, the whole environment is also frictionless and without gravity. • The agent is rewarded when the ball goes through the gap at the top, and is negatively rewarded when the ball falls through the bottom, or (larger negative reward) if the bat goes "out of bounds". Environment screenshots: Video clip:
Conclusions • In the first part of the project we mostly dealt with domains that were almost tailored for the algorithms. • As expected, Qcut and Kcluster outperformed QL for the examined domains. • 2. In the second part, we demonstrated the advantage of the hierarchical approach over standard Q-learning, even for domains which do not posses a clear hierarchical structure. • In domains that have small state-spaces, Qcut and Kclusterexhibited similar performance. • However, in domains with larger state-spaces with many state-transitions, Q-cut is not applicable, as its polynomial complexity of O(N^3) grows to unmanageable proportions.
Future work • Use the framework we set up to simulate additional complex (dynamic) environments. Video clips: • Currently Qcut or Kcluster are only invoked once. Even if later changes are detected there will be no attempt to re-cut or re-cluster the state-space. It could be beneficial to perform successive cuts/clusters. • Qcut could be still used in large state-space domains, if an approximation of the min-cut-max-flow algorithm is used instead of the original algorithm. • Improve the quality factor which is used for the Kcluster algorithm.