80 likes | 307 Views
Temporal Difference Learning. Mark Romero – 11/03/2011. Introduction. Temporal Difference Learning combines idea from the Monte Carlo Methods and Dynamic Programming Still sample the environment based on some policy Determine current estimate based on previous estimates
E N D
Temporal Difference Learning Mark Romero – 11/03/2011
Introduction • Temporal Difference Learning combines idea from the Monte Carlo Methods and Dynamic Programming • Still sample the environment based on some policy • Determine current estimate based on previous estimates • Predictions are adjusted as time goes on to match other more accurate predications • Temporal Difference Learning is popular for its simplicity and on-line applications
MC vs TD • Constant-α MC: R(t) – actual return (reward) α – constant step-sized parameter Because the actual return is used, we must wait until the end of the episode to determine the update to V.
MC vs TD • TD(0): rt+1 – observed award γ– discount rate TD method only waits for the next time step. At time t+1 a target can be formed and an update made using the observed reward, rt+1 , and estimate, V(st+1). In effect, TD(0) targets rt+1 + γV(st+1) instead of R(t) in the MC method Called bootstrapping because update is based on previous estimate
Psuedo Code Initialize V(s) arbitrarily, and π to the policy to be evaluated Repeat (for each episode): Initialize s Repeat (for each step of episode): α <- action given by π for s Take action α observe reward r, and next state, s’ V(s) <- V(s) + α[r + γV(s’) – V(s)] s <- s’ until s is terminal
Advantages over MC • Lends itself naturally to on-line applications • MC must wait until end of the episode to adjust reward, TD only needs one time step • Turns out this is critical consideration • Some applications have long episodes or no episodes at all • TD learns from every transition • MC methods generally discount or throw out episodes where an experimental action was taken • TD converges faster than constant-α MC in practice • No formal proof has been developed
Soundness • Is TD sound? • Yes, for any fixed policy the TD algorithm has been proven to Vπ, provided a sufficiently small constant step-size parameter, or if the step-size parameter decreases according to the usual stochastic approximation conditions.