190 likes | 293 Views
Using MDP Characteristics to Guide Exploration in Reinforcement Learning. Paper: Bohdana Ratich & Doina Precucp Presenter: Michael Simon Some pictures/formulas gratefully borrowed from slides by Ratich. MDP Terminology. Transition Probabilities - P a s,s’ Expected reward - R a s,s’ Return.
E N D
Using MDP Characteristics to Guide Exploration in Reinforcement Learning Paper: Bohdana Ratich & Doina Precucp Presenter: Michael Simon Some pictures/formulas gratefully borrowed from slides by Ratich
MDP Terminology • Transition Probabilities - Pas,s’ • Expected reward - Ras,s’ • Return
Reinforcement Learning • Learning only on environmental rewards • Achieve the best payoff possible • Must balance exploitation with exploration • exploration can take large amounts of time • The structure of the problem/model can assist in the exploration, in theory • But with what in our MDP case?
Goals/Approach • Find MDP Characteristics... • ... that affect performance... • ... and test on them. • Use MDP Characteristics... • ... to tune parameters. • ... to select algorithms. • ... to create strategy.
Back to RL • Undirected • Sufficient Exploration • Simple, but can be exponential • Directed • Extra Computation/Storage, but possibly polynomial • Often uses aspects of the model to its advantage
RL Methods - Undirected • -greedy exploration • Probability 1- of exploiting based on your best greedy guess at the moment • Explore with probability , select action (uniform) randomly • Boltzman Distribution
RL Methods - Directed • Maximize w/Exploration Bonuses • Different options for • Counter-based (least frequently) • Recency-based (most frequently) • Error-based (most variable in estimation value) • Interval Estimation (highest variance in samples)
Properties of MDPs • State Transition Entropy • Controllability • Variance of Immediate Rewards • Risk Factor • Transition Distance • Transition Variability
State Transition Entropy • Stochasticity of State Transitions • High STE = good exploration • Potential variance of samples needed • High STE = more samples needed
Controllability - Calculation • How much the environment’s response differs for an action • Can also be thought of as normalized information gain
Controllability - Usage • High controllability • Control over actions • Different actions lead to different parts of the space • More variance = more sampling needed • Take actions leading to controllable states • Actions with Forward Controllability (FC)
Proposed Method • Undirected • Explore w/ probability • For experiments • K1, K2 = {0,1} = 1, = {0.1, 0.4, 0.9}
Proposed Method • Directed • Pick action maximizing • For Experments • K0 = {1, 10, 50}, K1, K2 = {0,1}, K3 = 1 • is recency based
Experiments • Random MDPs • 225 states • 3 actions • 1-20 branching factor • transition probs/rewards uniform [0,1] • 0.01 chance of termination • Divided into 4 groups • Low STE, High STE • High variation (test) vs. low variation (control)
Experiments Continued • Performance Measures • Return Estimates • Run greedy policy from 50 different states, 30 trials per state, average returns, normalize • Penalty Measure • Rmax = upper limit of return of optimal • Rt is normalized greedy return after trial t • T = # of trials
Discussion • Significant results obtained when using STE and FC • Results correspond with presence of STC • Values can be calculated prior to learning • Requires model knowledge • Rug Sweeping and more judgements • SARSA