360 likes | 437 Views
Reinforcement Learning in the Control of Attention. Roderic A Grupen. Luiz M G Gonçalves. Laboratory for Analysis and Architecture of Systems (State University of Campinas-near future) www.laas.fr/~lmgarcia. Laboratory for Perceptual Robotics State University of Massachusetts (USA)
E N D
Reinforcement Learning in the Control of Attention Roderic A Grupen Luiz M G Gonçalves Laboratory for Analysis and Architecture of Systems (State University of Campinas-near future) www.laas.fr/~lmgarcia Laboratory for Perceptual Robotics State University of Massachusetts (USA) www-robotics.cs.umass.edu
Objective • To develop a robotic system to perform tasks involving attention and pattern categorization, integrating multi-modal (haptic and visual) information in a behaviorally cooperative active system.
Motivation • Towards finding an useful robotic system able to: • foveate (verge) the eyes onto a ROI; • keep attention on the ROI if needed; • choose another ROI (shift focus of attention). • Result is a behaviorally cooperative active system, which provides on-line feedback to environmental stimuli in form of actions
Method • Use of (real time) visual information from a stereo head and a simulator • Selective Attention (bottom-up salience maps) • Multi feature extraction (perceptual state) • Associative memory (pattern address identification) • Efficient topological mapping • Learn policies to program the system
Task Specification (Objectives) • Visual Monitoring or Environment Inspection • Construction of an attentional map • Keep this map consistent with a current perception (update) • Categorize all patterns
Processo Markoviano • Um processo estocástico cujo passado não influencia o futuro se o seu presente está completamente especificado • Ex: Jogo de damas, Xadrez
Programação Dinâmica • Percorrer todos os estados possíveis, testando todas as possibilidades (executar todas as ações infinitamente) • Solução melhor (PD): • Reduzir a complexidade de um problema que pode ser resolvido em uma dimensão D para dois ou mais problemas em dimensões menores • Ex: Disparidade estéreo: • 1 problema em 3D (x,y,d) é reduzido para 2 problemas em 2D (x,d) e (y,d)
Pavlov • Animal faz certo, ganha comida • Animal faz errado, apanha • Em teoria, é provado que apenas um deles (recompensa ou punição funciona): fez coisa errada, não ganha comida. • Assim: • robô fez certo => recompensa
Reinforcement Learning(Related Work) • Watkins: Learning from Delayed Rewards (1989). • Sutton/Barto: Reinforcement Learning: An Introduction (1998). • Araujo: Learning a Control Composition in a Complex Environment (1996). • Huber: A Feedback Control Structure for On-line Learning tasks (1997). • Coelho: A Control Basis for Learning Multifingered Grasps (1997).
Modelling a problem with delayed reinforcement as an MDP: • a set of states (estados) S, • a set of actions (operadores) A, • a reward function R:SxA, and • a state transition function T:SxA (S), which maps states transition to probabilities. • Q-learning equation:
Q-learning equation • a = ação executada • r = recompensa • s’ = estado resultante de aplicar a • A = todas as ações possíveis a’ de serem executadas em s’ • = learning rate (geralmente 0.1) • = fator de disconto (geralmente 0.5)
Observações • Uma transição no espaço de estados pode ser completamente caracterizada pelo vetor (s,a,r,s’) • Supondo que para todos os pares (s,a), Q(s,a) seja atualizado infinitamente (muitas vezes) para todo par (s,a), Q(s,a) converge com probabilidade 1 para a melhor recompensa possível para este par.
Exploração e explotação • Exploração; randomicamente escolher uma ação • Explotação: após certo tempo, o sistema começa a convergir, assim, escolhe-se ações que sabe-se estejam contribuindo para a convergência • Balancear entre exploração e explotação • Temperatura (lembra Simulated Annealing) • Escolher randomicamente em função da temperatura (inicial alta, depois baixa) • Na prática, mesmo no final, ainda 10% randomico
Algoritmo Q-learning • 1) Define current state s by decoding sensory information available; • 2) Use stochastic action selector to determine action a; • 3) Perform action a, generating new state s’ and a reinforcement r; • 4) Calculate temporal differencial error r’: • 5) Update Q-value of the state/action pair(s,a) • 6) Go to 1;
Elegibility trace • Atualizar não apenas um par estado-ação de cada vez, mas sim uma seqüência de pares (após execução de uma série de ações). • Ganho em convergência
Na prática • Uma tabela (Q-table) • Linhas são os estados (s) • Colunas são as ações (a) • Elemento Q(s,a) são os Q-values, valores dados pela função que avalia a utilidade de tomar a ação a quando o estado é s
Low-level Control • Defining a target • Pre-attentional phase (stimuli + internal biased) • Shifting attention (saccade generation) • Fine saccade (using target model) • Verging eyes onto a target (correlation) • Movements are computed from errors to image centers
Low-level Control • Identifying Objects • Selecting a region of interest • Extracting features • Associative memory match • Mapping objects and/or updating memory • Pre-attentional maps • Automatic supervised learning
A straight-forward control algorithm • Step 0: Initialize the associative memory and start the concurrent controllers of arms, neck, and eyes. • Step 1: Re-direct attention; if a representation is activated, update attentional maps and re-do this step (1). • Step 2: Try a visual improvement; if a representation is activated, update attentional maps and return to step 1. • Step 3: Try an arm improvement; if a representation is activated, update attentional maps and return to step 1; • Step 4: Activate “supervised learning” module, update attentional maps and return to step 1.
Results • Q-learning convergence
Partial Evaluation of strategies Attentional Shifts
Partial Evaluation of strategies Visual/arm Improv
Partial Evaluation of strategies Objects Identified
Partial Evaluation of strategies New objects
Global evaluation Mapped objects
Task accomplishment Mapped objects
Times for each phase or process • Phase Min(sec) Max(sec) Mean(sec) • Computing retina 0.145 0.189 0.166 • Transfer to host 0.017 0.059 0.020 • Total acquiring 0.162 0.255 0.186 • Pre-attention 0.139 0.205 0.149 • Salience map 0.067 0.134 0.075 • Total attention 0.324 0.395 0.334 • Total saccade 0.466 0.903 0.485 • Features for match 0.135 0.158 0.150 • Memory match 0.012 0.028 0.019 • Total matching 0.323 0.353 0.333
Conclusions • The system can support other sensors. • Attention and categorization act together: tasks must be formulated • Inspection task succesfully done. • Currently support a 10-15 frame rate. • Reinforcement learning appr. worked well in simulation
Future works • Consider focus for saccade generation and accomodation (vergence) • Test with partially ocluded objects • Derive policies (with Q-learning) for control of top-down attention • Increase the state space and/or the set of actions • Define other hierarchical tasks (several policies, each appropriate for a given task) • Test learning architecture on a real environment
Thanks • Thanks to CNPQ, CAPES, FAPERJ, NSF and UMASS (USA) • To all of you for your patience • To Mimmo and Dr. Arcangelo Distante for hosting me:-).