440 likes | 636 Views
Study on Genetic Network Programming (GNP) with Learning and Evolution. Hirasawa laboratory , Artificial Intelligence section Information architecture field Graduate School of Information, Production and Systems Waseda University. I Research Background.
E N D
Study on Genetic Network Programming (GNP) with Learning and Evolution Hirasawa laboratory, Artificial Intelligence section Information architecture field Graduate School of Information, Production and Systems Waseda University
I Research Background • Systems are becoming large and complex • robot control • elevator Group Control System • Stock trading system It is very difficult to make efficient control rules considering many kinds of real world phenomena Intelligent systems (evolutionary and learning algorithms) can solve problems automatically
II Objective of the research • propose an algorithm which combines evolution and learning • In the natural world・・・ • evolution―Many individuals (living things) adapt to the world (environment) through long time of generations • learning―the knowledge the living things acquire in their life time through trial-and-error give inherentfunctions and characteristics to the living things the knowledge acquired in the course of their life
Evolution Characteristics of living things are determined by genes Evolution gives inherent characteristics and Functions Evolution is realized the following components selection crossover mutation
Selection Those who fit into an environment survive, otherwise die out. Crossover Genes are exchanged between two individuals New individuals are produced Mutation Some of the genes of the selected individuals are changed to other ones New individuals are produced
Important factors in reinforcement learning • State transition (definition of states and actions) • Trial and error learning • Future prediction
Framework of Reinforcement Learning • Learn action rules through the interaction between an agent and an environment. agent State signal (sensor input) Action Reward (evaluation on the action) environment The aim of RL is to maximize the total rewards obtained from the environment
Reward 100 State transition • State transition An action taken at time t at at+1 at+2 …… st+1 st st+2 st+n rt+1 rt+2 Reward rt State at time t Goal!! rt+n Example: maze problem st+n st+1 st+2 st …… at+n: do nothing (end) at+2: move left at: move right at+1: move upward start
Trial-and-error learning Acquired knowledge Take this action again! concept of reinforcement learning trial and error learning method Success (get reward) Don’t take this action again agent take the action Decide an action Failure (get negative reward) Reward (scalar value): indicate whether good action or not
Future prediction • Reinforcement learning estimates the future rewards and take actions st+3 at+2 future st+2 at+1 st+1 at rt+2 st rt+1 Reward rt current time
Future prediction • Reinforcement learning considers the rewards not only current but also the future rewards Reward rt=1 rt+1=1 rt+2=1 st at+1 st+1 at at+2 st+2 st+3 Case 1 at+1 st+1 at+2 at st+2 st+3 Case 2 Reward rt=0 rt+1=0 rt+2=100
Genetic Network Programming (GNP) GNP is an Evolutionary Computation. What’s Evolutionary Computation? gene solution = • Solutions (programs) are represented by genes • The programs are evolved (changed) by selection, crossover and mutation
Structure of GNP • GNP represents its programs using directed graph structures. • The graph structures can be represented as gene structures. • The graph structure is composed of processing nodes and judgment nodes. … … … … Graph structure gene structure
Khepera robot • Khepera robot is used for the performance evaluation of GNP obstacle Sensor value sensor Far from obstacles Close to zero Close to obstacles Close to 1023 wheel -10 (back) ~ 10 (forward) Speed of the right wheel VR Speed of the left wheel VL -10 (back) ~ 10 (forward)
Node functions Processing node Each node determines an agent action Ex) khepera robot behavior Set the speed of the right wheel at 10 Judgment node Each node selects a branch based on the judgment result 500 or more Less than 500 Judge the value of sensor 1
An example of node transition Judge sensor 5 80 or more Less than 80 The value is 700 or more The value is less than 700 Set the speed of the right wheel at 5 Judge sensor 1
Flowchart of GNP start Generate an initial population (initial programs) Task execution Reinforcement Learning Evolution Selection / Crossover / Mutation one generation Last generation stop
Evolution of GNP selection Select good individuals (programs) from the population based on their fitness Fitness indicates how much each individual achieves a given task ・・・ used for crossover and mutation GNP population
Evolution of GNP crossover Some nodes and their connections are exchanged. Individual 1 Individual 2
mutation Change connections Change node function Speed of Left wheel: 10 Speed of Right wheel: 5
Judgment node 1000 is changed to 500 in order to judge obstacle sensitively Processing node 10 is changed to 5 not to collide with the obstacle The role of Learning Example) 1000 or more Less than 1000 Judge sensor 0 Set the speed of the right wheel at 10 Collision! Node parameters are changed by reinforcement learning
The aim of combining evolution and learning Evolution uses many individuals and better ones are selected after task execution Learning uses one individuals and better action rules can be determined during task execution • create efficient programs • search for solutions faster
: If the condition 1 and 2 is satisfied : otherwise VI Simulation • Wall-following behavior • All the sensor values must not be more than 1000 • At least one sensor value is more than 100 • Move straight • Move fast Simulation environment
Node functions .....
Simulation result • conditions • The number of individuals: 600 • The number of nodes: 34 • Judgement nodes: 24 • Processing nodes: 10 GNP with learning and evolution start fitness Standard GNP (GNP with evolution) Track of the robot generation fitness curves of the best individuals averaged over 30 independent simulations
Simulations in the inexperienced environments Simulation on the generalization ability The best program obtained in the previous environment Execute in the inexperienced environment start start The robot can show the wall-following behavior.
VII Conclusion • The algorithm of GNP using evolution and reinforcement learning is proposed. • From the simulation results, the proposed method can learn wall-following behavior well. • Future work • Apply GNP with evolution and reinforcement learning to real world applications • Elevator control system • Stock trading model • Compare with other evolutionary algorithms
VIII other simulations tileworld wall Agent can push a tile and drop it into a hole. The aim of agent is to drop tiles into holes as many as possible. floor tile hole agent Example of tileworld Fitness = the number of dropped tiles Reward rt = 1 (when dropping a tile into a hole)
Example of node transition Direction of the nearest hole backward right wall floor agent forward tile hole left nothing Go forward What is in the forward?
Simulation 1 • There are 30 tiles and 30 holes • same environment every generation • Time limit: 150 steps Environment I
Fitness curve(simulation 1) GNP with learning and Evolution GNP with evolution fitness GP-ADFs (main tree:max depth 3 GP(max depth 5) ADF: depth 2) EP(evolution of Finite State Machine) generation
Simulation 2 • Put 20 tiles and 20 holes at random positions • One tile and one hole appear just after an agent push a tile into a hole • Time limit: 300 steps Environment II (example of an initial state)
Fitness curve (simulation 2) GNP with learning and evolution GNP with evolution fitness EP GP-ADFs (main tree:max depth3 ADF: depth 2) GP(max depth 5) generation
Ratio of used nodes Last generation Initial generation Ratio of used nodes Ratio of used nodes Judge forward Judge backward Go forward Turn left Turn right Do nothing Judge left side Direction of tile direction of hole Direction of hole from tile Judge right side Second nearest tile Judge forward Judge backward Go forward Turn left Turn right Judge left side Direction of tile direction of hole Direction of hole from tile Judge right side Second nearest tile Do nothing Node function Node function
Summary of the simulations Data on the best individuals obtained at the last generation (30 samples) Simulation I Simulation II
Summary of the simulations Calculation time comparison Simulation I Simulation II
The program obtained by GNP 0step 6 1 2 3 4 5 8 9 10 11 12 7 14 15 13 16
Maze problem objective:reach goal as early as possible K G wall The key is necessary to open the door in front of the goal floor Time limit: 300 step door agent K key G goal fitness= rewardrt = 1 (when reaching the goal) Remaining time (when reaching the goal) 0 (when the agent cannot reach the goal)
Fitness curve(maze problem) GNP with learning and Evolution (GNP-LE) Fitness GNP with evolution (GNP-E) GP generation Data on the best individuals obtained at the last generation (30 samples)