Advanced MDP Topics

Advanced MDP Topics Ron Parr Duke University

Value Function Approximation • Why? • Duality between value functions and policies • Softens the problems • State spaces are too big • Many problems have continuous variables • “Factored” (symbolic) representations don’t always save us • How • Can tie in to vast body of • Machine learning methods • Pattern matching (neural networks) • Approximation methods

Implementing VFA • Can’t represent V as a big vector • Use (parametric) function approximator • Neural network • Linear regression (least squares) • Nearest neighbor (with interpolation) • (Typically) sample a subset of the the states • Use function approximation to “generalize”

Basic Value Function Approximation Idea: Consider restricted class of value functions resample? Subset of states VI FA V0 V*? Alternate value iteration with supervised learning

VFA Outline 1. Initialize V0(s,w0), n=1 2. Select some s0…si 3. For each sj 4. Compute Vn(s,wn) by training w on 5. n := n+1 6. Unless Vn+1-Vn<e goto 2 If supervised learning error is “small”, then Vfinal “close” to V*.

Stability Problem Problem: Most VFA methods are unstable s2 s1 No rewards, g = 0.9: V* = 0 Example: Bertsekas & Tsitsiklis 1996

Least Squares Approximation Restrict V to linear functions: Find q s.t. V(s1) = q, V(s2) = 2q V(x) s1 s2 S Counterintuitive Result: If we do a least squares fit of q qt+1 = 1.08 qt

Unbounded Growth of V V(x) n 1 2 S

What Went Wrong? • VI reduces error in maximum norm • Least squares (= projection) non-expansive in L2 • May increase maximum norm distance • Grows max norm error at faster rate than VI shrinks it • And we didn’t even use sampling! • Bad news for neural networks… • Success depends on • sampling distribution • pairing approximator and problem

Success Stories - Linear TD • [Tsitsiklis & Van Roy 96, Bratdke & Barto 96] • Start with a set of basis functions • Restrict V to linear space spanned by bases • Sample states from current policy Space of true value functions P = Projection VI Restricted Linear Space N.B. linear is still expressive due to basis functions

Linear TD Formal Properties • Use to evaluate policies only • Converges w.p. 1 • Error measured w.r.t. stationary distribution • Frequently visited states have low error • Infrequent states can have high error

Linear TD Methods • Applications • Inventory control: Van Roy et al. • Packet routing: Marbach et al. • Used by Morgan Stanley to value options • Natural idea: use for policy iteration • No guarantees • Can produce bad policies for trivial problems [Koller & Parr 99] • Modified for better PI: LSPI [Lagoudakis & Parr 01] • Can be done symbolically [Koller & Parr 00] • Issues • Selection of basis functions • Mixing rate of process - affects k, speed

Success Story: Averagers [Gordon 95, and others…] • Pick set, Y=y1…yi of representative states • Perform VI on Y • For x not in Y, • Averagers are non expansions in max norm • Converge to within 1/(1-g) factor of “best”

Interpretation of Averagers y1 b1 b2 x y2 b3 y3

Interpretation of Averagers II Averagers Interpolate: y1 y2 Grid vertices = Y x y4 y3

General VFA Issues • What’s the best we can hope for? • We’d like to get approximate close to • How does this relate to • In practice: • We are quite happy if we can prove stability • Obtaining good results often involves an iterative process of tweaking the approximator, measuring empirical performance, and repeating…

Why I’m Still Excited About VFA • Symbolic methods often fail • Stochasticity increases branching factor • Many trivial problems have no exploitable structure • “Bad” value functions can have good performance • We can bound “badness” of value functions • By simulation • Symbolically in some cases [Koller & Parr 00; Guestrin, Koller & Parr 01; Dean & Kim 01] • Basis function selection can be systematized

Hierarchical Abstraction • Reduce problem in to simpler subproblems • Chain primitive actions into macro-actions • Lots of results that mirror classical results • Improvements dependent on user-provided decompositions • Macro-actions great if you start with good macros • See Dean & Lin, Parr & Russell, Precup, Sutton & Singh, Schmidhuber & Weiring, Hauskrecht et al., etc.

Advanced MDP Topics

Advanced MDP Topics

Presentation Transcript

Advanced Topics

MDP

MDP

Advanced topics

Advanced Topics

Advanced Topics

Advanced Topics and Recent Topics

Advanced Topics:

Advanced Topics

Advanced Topics

REDCap Advanced Topics

Advanced Topics

Advanced Topics

Advanced Topics

ADVANCED TOPICS

Advanced Topics

Advanced Topics

DCM: Advanced Topics

Advanced Topics

MDP

Advanced Topics

Advanced topics