290 likes | 457 Views
Integrating POMDP and RL for a Two Layer Simulated Robot Architecture. Presented by Alp Sardağ. Two Layer Architecture. The lower layer provides fast, short horizon decision. The lower layer is designed to keep robot out of trouble.
E N D
Integrating POMDP and RL for a Two Layer Simulated Robot Architecture Presented by Alp Sardağ
Two Layer Architecture • The lower layer provides fast, short horizon decision. • The lower layer is designed to keep robot out of trouble. • The upper layer ensures that the robot continually works toward its target task or goal.
Advantages • Offers reliability. • Reliability: the robot must be able to deal with failure of sensors and actuators. • Hardware failure = mission failure • Example, robots operating out of direct human control: • Space exploration • Office robot
The System • It has two levels of control: • The lower level controls the actuators that move the robot around and provides a set of behaviors that can be used by the higher level of control. • The upper level, planning system, plans a sequence of actions in order to move the robot from its current location to the goal.
The Architecture • The bottom level is accomplished by RL: • RL as an incremental learning is able to learn online. • RL can adapt changes in the environment. • RL reduce the programmer intervention.
The Architecture • The higher level is POMDP planner: • POMDP planner operates quickly once a policy is generated. • POMDP planner can provide reinforcement needed by lower level behaviors.
The Test • For test, the Kephera robot simulator is used. • Kephera has limited sensors. • It has well-defined environment. • The simulator can run much faster than real-time. • The simulator does not require human intervention for low battery conditions and sensor failures.
Methods for Low-Level Behaviors • Subsumption • Learning from examples. • Behavioral cloning.
Methods for Low-Level Behaviors • Neural systems tend to be robust to noise and perturbation in the environment. • GeSAM is a neural network based robot hand control system. GeSAM uses adaptive neural network. • Neural networks often require long trainning periods and large amounts of data.
Methods for Low-Level Behaviors • RL can learn continuously. • RL provide adaptation to sensor drift and changes in actuators. • In many extreme cases, sensor or actuator failures adapt enough to allow the robot to accomplish the mission.
Planning at the Top • POMDP deals with the uncertainity. • For Kephera, with limited sensors, determining the exact state is very difficult. • Also, the effects of actuators may not be deterministic.
Planning at the Top • Some rewards are associated with the goal state. • Some rewards are associated with performing some action in a certain state. • Thus, this will allow to define complex, compound goals.
Drawback • The current POMDP solution method: • Does not scale well with the size of state space. • Exact solutions are only feasible for very small POMDP planning problems. • Requires that the robot be given a map, which is not always feasible.
What is Gained? • By combining RL and POMDP, the system is robust to changes. • RL will learn how to use the damaged sensors and actuators. • Continuous learning has some drawbacks when using backpropagation neural networks. Over-trainning. • POMDP adapt to sensor and actuator failures by adjusting the transition probabilities.
The Simulator • Pulse encoders are not used in this work. • The simulation results can be successfully transferred to a real robot. • The sensor model includes stochastic modeling of noise and responds similarly to the real sensors. • The simulation environment includes some stochastic modeling of wheel slippage and accelaration. • Hooks are added into the simulator to allow to simulate sensor failures. • Effector failures are simulated in the code.
RL Behaviors • Three basic behavior, move forward, turn right and turn left. • The robot is always moving or performing an action. • RL is responsible for dealing: • With obstacles, • With adjusting sensor or actuator malfunction.
RL Behaviors • The goal of the RL module is to maximize the reward given them by the POMDP planner. • The reward is a function how long it took to make a desired state transition. • Each behaviors has its own RL module. • Only one RL module can be active in a given time. • Q-learning with table lookup for approximating the value function. • Fortunately, the problem so far small enough for table lookup.
POMDP planning • Since robots can rarely determine their state from sensor observations, COMDP do not work well in many real-world robot planning tasks. • It is more adequate to use the state probability distribution, and update using the information about transition and obsservation probabilities.
Sensor Grouping • Kephera has 8 sensors that report distance values between 0 and 1024. • The observations are reduced to 16: • The sensors are grouped in pairs to make 4 pseudo sensors, • Tresholding applied to the output of the sensors. • POMDP planner is now robust to single sensor failures.
Solving a POMDP • Witness algorithm is used to compute the optimal policy for POMDP. • Witness does not scale well with the size of the state space.
Environment and State Space • 64 possible state for the robot: • 16 discrete positions. • Robot’s heading is disceretized into the four compass directions. • Sensor information was reduced to 4 bits by combining the sensors in pairs and thresholding. • Solution to LP required several days on a Sun Ultra 2 workstation.
Interface Between Layers • POMDP uses current belief state to select low level behavior to activate. • The implementation tracks the state with the highest probability: the most likely current state. • If the most likely current state changes to the state that POMDP want, a reward of 1, otherwise –1 is generated.
Hypothesis • Since RLPOMDP is adaptive, the author expect that the overall performance should degrade gracefully as sensors and actuators gradually fails.
Evaluation • State 13 is the goal state. • POMDP state transition and observation probabilities obtained by placing the robot in each 64 state and taking each action ten times. • With the policy in place,RL modules are trained in the same way. • For each system configuration (RL or hand coded), the simulation is started from every position and orientation and performance is recorded.
Metrics • Failures during trial evaluating the reliability. • Average steps to goal asses the efficiency.
Gradual Sensor Failure • Battery power is used up, dust accumulates on sensors.
Intermittent Actuator Failure • Right motor control signal failed.
Conclusion • The RLPOMDP exihibits robust behavior in the presence of sensor and actuator degradation. • Future work scaling the problem. • To overcome the scaling problem of table lookup of RL, neural nets can be used (learnforget cycle). • To increase the size of the space for the POMDP, non-optimal solution algorithms are investigated. • New behaviors will be added.