150 likes | 326 Views
Continous-Action Q-Learning. Jose Del R.Millan et al, Machine Learning 49, 247-265 (2002). Summarized by Seung-Joon Yi. ITPM(Incremental Topology Preserving Map). Consists of units and edges between pairs of units. Maps current sensory situation x onto action a.
E N D
Continous-Action Q-Learning Jose Del R.Millan et al, Machine Learning 49, 247-265 (2002) Summarized by Seung-Joon Yi (C) 2003, SNU Biointelligence Lab, http://bi.snu.ac.kr/
ITPM(Incremental Topology Preserving Map) • Consists of units and edges between pairs of units. • Maps current sensory situation x onto action a. • Units are created incrementally and incorporates bias • After being created, the units’ sensory component is tuned by self-organizing rules • Their action component is updated through reinforcement learning. (C) 2003, SNU Biointelligence Lab, http://bi.snu.ac.kr/
ITPM • Units and bias • Initially the ITPM has no units and they are created as the robot uses built-in reflexes. • Units in the network have overlapping localized receptive fields. • When the neural controller makes incorrect generalizations, reflexes get control of the robot and it adds a new unit to the ITPM. (C) 2003, SNU Biointelligence Lab, http://bi.snu.ac.kr/
ITPM • Self-organizing rules (C) 2003, SNU Biointelligence Lab, http://bi.snu.ac.kr/
ITPM • Advantages • Automatically allocates units in the visited parts of the input space. • Adjusts dynamically the necessary resolution in different regions. • Experiments show that in everage every unit is connected to 5 others at the end of learning episodes. (C) 2003, SNU Biointelligence Lab, http://bi.snu.ac.kr/
ITPM • General learning algorithm (C) 2003, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Discrete-action Q-Learning • Action selection rule • Ε-greedy policy • Q-value update rule (C) 2003, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Continous-action Q-Learning • Action selection rule • An average of the discrete actions of the nearest unit weighted by their Q-values • Q-value of the selected continous action a is: (C) 2003, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Continous-action Q-Learning • Q-value update rule (C) 2003, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Average-Reward RL • Q-value update rule (C) 2003, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Experiments • Wall following task • Reward (C) 2003, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Experiments • Performance comparison between discrete and continous discountd-rewarded RL (C) 2003, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Experiments • Performance comparison between discrete and continous average-rewarded RL (C) 2003, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Experiments • Performance comparison between discounted and average-rewarded RL,discrete-action case (C) 2003, SNU Biointelligence Lab, http://bi.snu.ac.kr/
Conclusion • Presented a simple Q-learning that works in continous domains. • ITPM represents continous input space • Compared discounted-rewarded RL against average-awarded RL (C) 2003, SNU Biointelligence Lab, http://bi.snu.ac.kr/