580 likes | 696 Views
Tópicos Especiais em Aprendizagem. Prof. Reinaldo Bianchi Centro Universitário da FEI 2007. Objetivo desta Aula. Aprendizado por Reforço: Métodos de Monte Carlo. Aprendizado por Diferenças Temporais. Traços de Elegibilidade. Aula de hoje: capítulos 5, 6 e 7 do Sutton & Barto.
E N D
Tópicos Especiais em Aprendizagem Prof. Reinaldo Bianchi Centro Universitário da FEI 2007
Objetivo desta Aula • Aprendizado por Reforço: • Métodos de Monte Carlo. • Aprendizado por Diferenças Temporais. • Traços de Elegibilidade. • Aula de hoje: capítulos 5, 6 e 7 do Sutton & Barto.
O que é o Aprendizado por Reforço? • Aprendizado por interação. • Aprendizado orientado a objetivos. • Aprendizado sobre, do e enquanto interagindo com um ambiente externo. • Aprender o que fazer: • Como mapear situações em ações. • Maximizando um sinal de recompensa numérico.
Agente no AR • Situado no tempo. • Aprendizado e planejamento continuo. • Objetivo é modificar o ambiente. Ambiente Ação Estado Recompensa Agente
Elementos do AR • Política (Policy): o que fazer. • Recompensa (Reward): o que é bom. • Valor (Value): o que é bom porque prevê uma recompensa. • Modelo (Model): o que causa o que. Policy Reward Value Model of environment
r r r . . . . . . t +1 t +2 s s t +3 s s t +1 t +2 t +3 a a a a t t +1 t +2 t t +3 The Agent-Environment Interface
The Agent Learns a Policy • Reinforcement learning methods specify how the agent changes its policy as a result of experience. • Roughly, the agent’s goal is to get as much reward as it can over the long run.
Goals and Rewards • Is a scalar reward signal an adequate notion of a goal?—maybe not, but it is surprisingly flexible. • A goal should specify what we want to achieve, not how we want to achieve it. • A goal must be outside the agent’s direct control—thus outside the agent. • The agent must be able to measure success: • explicitly; • frequently during its lifespan.
Returns Episodic tasks: interaction breaks naturally into episodes, e.g., plays of a game, trips through a maze. where T is a final time step at which a terminal state is reached, ending an episode.
Importante!!! • São muito diferentes: • Reward (rt): • O que ganha quando faz uma ação. • Return (Rt): • É o retorno esperado. • A relação entre um e outro pode ser: • Expected Return (E{Rt}): • É o que se deseja maximizar.
Returns for Continuing Tasks Continuing tasks: interaction does not have natural episodes. Discounted return:
The Markov Property • Ideally, a state should summarize past sensations so as to retain all “essential” information, i.e., it should have the Markov Property:
Defining a Markov Decision Processes • To define a finite MDP, you need to give: • state Sand action setsA(s). • one-step “dynamics” defined by transition probabilities: • reward expectations:
Value Functions • The value of a state is the expected return starting from that state; depends on the agent’s policy. • State-value function for policy:
Value Functions • The value of taking an action in a stateunder policy is the expected return starting from that state, taking that action, and thereafter following . • Action-value function for policy :
Bellman Equation for a Policy The basic idea: So: Or, without the expectation operator:
Policy Iteration policy evaluation policy improvement “greedification”
Value Iteration Recall the full policy evaluation backup: Here is the full value iteration backup:
Fim da Revisão • Importante: • Conceitos básicos bem entendidos. • Problema: • DP necessita do modelo de transição de estados P. • Como resolver este problema, se o modelo não é conhecido?
Métodos de Monte Carlo Capítulo 5 do Sutton e Barto.
Monte Carlo Methods • Métodos de Monte Carlo permitem aprender a partir de exemplos de retornos completos (complete sample returns) • Definido para tarefas episódicas. • Métodos de Monte Carlo possibilitam o aprendizado baseado diretamente em experiências: • On-line: Não necessita de um modelo para atingir a solução ótima. • Simulated: Não necessita de um modelo completo.
Wikipedia: Monte Carlo Definition • Monte Carlo methods are a widely used class of computationalalgorithms for simulating the behavior of various physical and mathematical systems. • They are distinguished from other simulation methods (such as molecular dynamics) by being stochastic, usually by using random numbers - as opposed to deterministic algorithms. • Because of the repetition of algorithms and the large number of calculations involved, Monte Carlo is needs large computer power.
Wikipedia: Monte Carlo Definition • A Monte Carlo algorithm is a numerical Monte Carlo method used to find solutions to mathematical problems (which may have many variables) that cannot easily be solved, for example, by integral calculus, or other numerical methods. • For many types of problems, its efficiency relative to other numerical methods increases as the dimension of the problem increases.
? Lose Lose Win Lose Monte Carlo principle http://nlp.stanford.edu/local/talks/mcmc_2004_07_01.ppt • Consider the game of solitaire: what’s the chance of winning with a properly shuffled deck? • Hard to compute analytically because winning or losing depends on a complex procedure of reorganizing cards • Insight: why not just play a few hands, and see empirically how many do in fact win? • More generally, can approximate a probability density function using only samples from that density Chance of winning is 1 in 4!
p(x) X Monte Carlo principle • Given a very large set X and a distribution p(x) over it • We draw a set of N samples • We can then approximate the distribution using these samples
Monte Carlo principle • We can also use these samples to compute expectations • And even use them to find a maximum
Monte Carlo Example: Approximation of (the number)... • If a circle of radius r = 1 is inscribed inside a square whit side length L = 2, then we obtain: http://twt.mpei.ac.ru/MAS/Worksheets/approxpi.mcd
MC Example: Approximation of (the number)... • Inside the square, we can put N points at random with uniform distribution with (x,y) coordinates. • Now, we can to count how many points have fallen in the circle. http://twt.mpei.ac.ru/MAS/Worksheets/approxpi.mcd
MC Example: Approximation of (the number)... • If N is large enough, we can think that the ratio: http://twt.mpei.ac.ru/MAS/Worksheets/approxpi.mcd
MC Example: Approximation of (the number)... • For N = 1000: • NCircle = 768 • Pi = 3.072 • Error = 0.07
MC Example: Approximation of (the number)... • For N = 10000: • NCircle = 7802 • Pi = 3.1208 • Error = 0.021
MC Example: Approximation of (the number)... • For N = 100000: • NCircle = 78559 • Pi = 3.1426 • Error = 0.008
1 2 3 4 5 Monte Carlo Policy Evaluation • Goal: learn Vp(s) • Given: some number of episodes under p which contain s • Idea: Average returns observed after visits to s
Monte Carlo Policy Evaluation • Every-Visit MC: average returns for every time s is visited in an episode • First-visit MC: average returns only for first time s is visited in an episode • Both converge asymptotically.
Blackjack example • Object: Have your card sum be greater than the dealers without exceeding 21. • States (200 of them): • current sum (12-21) • dealer’s showing card (ace-10) • do I have a useable ace? • Reward: +1 for winning, 0 for a draw, -1 for losing • Actions: stick (stop receiving cards), hit (receive another card) • Policy: Stick if my sum is 20 or 21, else hit
Backup diagram for Monte Carlo • Entire episode included • Only one choice at each state (unlike DP) • MC does not bootstrap • Time required to estimate one state does not depend on the total number of states
Monte Carlo Estimation of Action Values (Q) • Monte Carlo is most useful when a model is not available • We want to learn Q* • Qp(s,a) - average return starting from state s and action a following • Also converges asymptotically if every state-action pair is visited • Exploring starts: Every state-action pair has a non-zero probability of being the starting pair
Monte Carlo Control • MC policy iteration: Policy evaluation using MC methods followed by policy improvement • Policy improvement step: greedify with respect to value (or action-value) function
Convergence of MC Control • Policy improvement theorem tells us: • This assumes exploring starts and infinite number of episodes for MC policy evaluation • To solve the latter: • update only to a given level of performance • alternate between evaluation and improvement per episode
Monte Carlo Exploring Starts Fixed point is optimal policy * Proof is open question
Blackjack example continued • Exploring starts • Initial policy as described before
greedy non-max On-policy Monte Carlo Control • On-policy: learn about policy currently executing. • How do we get rid of exploring starts? • Need soft policies: p(s,a) > 0 for all s and a • e.g. e-soft policy: Similar to GPI: move policy towards greedy policy (i.e. e-soft) Converges to best e-soft policy