LEAP Algorithm Reinforcement Learning with Adaptive Partitioning

LEAP AlgorithmReinforcement Learning with Adaptive Partitioning Tsufit Cohen Eyal Radiano Supervised by: Andrey Bernstein

Agenda • Intro • Q-learning • Leap • Algorithm • Simulation • LEAP vs Q-Learn • Conclusions

Intro • Reinforcement Learning • Learn optimal policy by trying • Reward for “Good” steps • Performance improvement סוכן

Q-learn • Definitions: • Key specification : • Table representation • לכתוב את הנוסחא של Q • ואת ההגדרות זה מעמוד 6 במאמר • כדי שנוכל להסביר מה זה Q • Exploration policy: epsilon greedy • אולי לפצל את זה לשני שקפים

LEAP Learning Entity (LE) Adaptive Partitioning • Key specifications : • Macro States • Multi Partitioning (each partition is called LE) • Pruning and Joining

Algorithm • Action Selection • Incoherence Criterion • JLE Generation • Update • Pruning Mechanism

Action Selection

Action Selection ( Cont. )

Incoherence Criterion

JLE Generation

Update

Pruning Mechanism

Changes and Add-ons to the Algorithm • Change the order of pruning and updating • Epsilon Greedy policy starts from 0.9 • Boundary condition – Q=0 for End of game.

LE CList<macrostate> Macro_list Int* ID_arr_ Int order Basic LE JLE CList<JLE>* Sons_lists_arr Implementation • Key Operation : • Finding Active LE List for a given state • Finding a macro state within a LE • Add/Remove JLE and/or macro state • Data Structures • Basic LE • JLE inheritance

Basic LE 1 Basic LE 2 Basic LE 3 Basic LE array: Basic LE 1 - magnification: macro list, Id array, order Sons list array pointer to JLEs list in order 1 (empty) pointer to JLEs list in order 2 pointer to JLEs list in order 3 General Data Structure Implementation

Basic LE array: Basic LE X Basic LE Y Basic LE Z Sons list array: Sons list array: Sons list array: 0 2 0 2 0 2 1 1 1 JLE XY JLE XYZ JLE YZ JLE XZ 3D Grid World Implementation Example

חלוקה לפי - x חלוקה לפי -y Simulation 1 – 2D Grid World Start point • Environment Properties: • Size: 20x20 • Step cost: -1 • Award: +2 • Available Moves: Up, Down, Left, Right • Wall Bumping – No movement. • Award Taking – Start a new episode. • Basic LEs: X,Y prize

Simulation 1 Results - Policy start Prize

Results – Average Reward & refined macrostates count

start prize Simulation 2 – Grid Word with an obstacle • Environment Properties : • Size : 5x5 • Step Cost: -1 • Award: +2 • Obstacle: -3

Simulation 2 – Grid Word with an obstacle

Simulation 2 Results start • Note: the policy changes – Due to Epsilon

LEAP vs Q-Learn

Conclusions • Memory reduction • Complexity of implementation • Deviation from optimal policy

Questions ?

LEAP Algorithm Reinforcement Learning with Adaptive Partitioning

LEAP Algorithm Reinforcement Learning with Adaptive Partitioning

Presentation Transcript

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Adaptive Reinforcement Learning Agents in RTS Games

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Adaptive Partitioning

REINFORCEMENT LEARNING

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Evaluating a Reinforcement Learning Algorithm with a General Intelligence Test

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning