250 likes | 385 Views
LEAP Algorithm Reinforcement Learning with Adaptive Partitioning. Tsufit Cohen Eyal Radiano Supervised by: Andrey Bernstein. Agenda. Intro Q-learning Leap Algorithm Simulation LEAP vs Q-Learn Conclusions. Intro. Reinforcement Learning Learn optimal policy by trying
E N D
LEAP AlgorithmReinforcement Learning with Adaptive Partitioning Tsufit Cohen Eyal Radiano Supervised by: Andrey Bernstein
Agenda • Intro • Q-learning • Leap • Algorithm • Simulation • LEAP vs Q-Learn • Conclusions
Intro • Reinforcement Learning • Learn optimal policy by trying • Reward for “Good” steps • Performance improvement סוכן
Q-learn • Definitions: • Key specification : • Table representation • לכתוב את הנוסחא של Q • ואת ההגדרות זה מעמוד 6 במאמר • כדי שנוכל להסביר מה זה Q • Exploration policy: epsilon greedy • אולי לפצל את זה לשני שקפים
LEAP Learning Entity (LE) Adaptive Partitioning • Key specifications : • Macro States • Multi Partitioning (each partition is called LE) • Pruning and Joining
Algorithm • Action Selection • Incoherence Criterion • JLE Generation • Update • Pruning Mechanism
Changes and Add-ons to the Algorithm • Change the order of pruning and updating • Epsilon Greedy policy starts from 0.9 • Boundary condition – Q=0 for End of game.
LE CList<macrostate> Macro_list Int* ID_arr_ Int order Basic LE JLE CList<JLE>* Sons_lists_arr Implementation • Key Operation : • Finding Active LE List for a given state • Finding a macro state within a LE • Add/Remove JLE and/or macro state • Data Structures • Basic LE • JLE inheritance
Basic LE 1 Basic LE 2 Basic LE 3 Basic LE array: Basic LE 1 - magnification: macro list, Id array, order Sons list array pointer to JLEs list in order 1 (empty) pointer to JLEs list in order 2 pointer to JLEs list in order 3 General Data Structure Implementation
Basic LE array: Basic LE X Basic LE Y Basic LE Z Sons list array: Sons list array: Sons list array: 0 2 0 2 0 2 1 1 1 JLE XY JLE XYZ JLE YZ JLE XZ 3D Grid World Implementation Example
חלוקה לפי - x חלוקה לפי -y Simulation 1 – 2D Grid World Start point • Environment Properties: • Size: 20x20 • Step cost: -1 • Award: +2 • Available Moves: Up, Down, Left, Right • Wall Bumping – No movement. • Award Taking – Start a new episode. • Basic LEs: X,Y prize
Simulation 1 Results - Policy start Prize
start prize Simulation 2 – Grid Word with an obstacle • Environment Properties : • Size : 5x5 • Step Cost: -1 • Award: +2 • Obstacle: -3
Simulation 2 Results start • Note: the policy changes – Due to Epsilon
Conclusions • Memory reduction • Complexity of implementation • Deviation from optimal policy