190 likes | 421 Views
Octopus Arm. Mid-Term Presentation. Dmitry Volkinshtein & Peter Szabo. Supervised by: Yaki Engel. Contents. Project’s Goal Octopus Arm and it’s Model The Learning Process The On-Line GPTD Algorithm Project Development Stages Program Structure Our work so far
E N D
Octopus Arm Mid-Term Presentation Dmitry Volkinshtein & Peter Szabo Supervised by: Yaki Engel
Contents • Project’s Goal • Octopus Arm and it’s Model • The Learning Process • The On-Line GPTD Algorithm • Project Development Stages • Program Structure • Our work so far • What’s left to be done
1. Project’s Goal Teach an octopus arm model to reach a given point in space.
2. Octopus Arm and it’s Model(1/4) An octopus is a carnivorous eight-armed sea creature with a large soft head and two rows of suckers on the underside of each arm (usually lives on the bottom of the ocean).
2. Octopus Arm and it’s Model(2/4) An octopus arm is a muscular hydrostat organ capable of exerting force with the sole use of muscles, without requiring a rigid skeleton.
2. Octopus Arm and it’s Model(3/4) • The model simulates the physical behavior of the arm in its natural environment. • The model gives us the position of the arm considering: • Muscle forces. • Internal forces that keep the arm’s volume constant (The arm is filled with liquid). • Gravitation and the floatation vertical forces. • The drag of the water.
2. Octopus Arm and it’s Model(4/4) • The real octopus arm is continuous. This model approximates the arm by dividing it into segments and calculating the forces on each segment separately. • The model we were given is the outcome of a previous project in this lab. It is a 2-dimensional and written in C.
3. The Learning Process(1/4) We use Reinforcement Learning (RL) methods to teach our model: • Reinforcement learning is a problem faced by an agent that must learn behavior through trial and error interactions with a dynamic environment. • We use RL in programming an agent by reward and punishment without needing to specify how the task is to be achieved.
3. The Learning Process(2/4) In our case: • The agent chooses which muscles to activate in a given time. • The model provides us the result of the activation (the next state of the arm). • The reward the agent gets depends on the arm’s state.
3. The Learning Process(3/4) • In RL the agent chooses his action in each state by a “Policy”. In order to improve the policy, we should calculate the “Value” of each state for that given policy. • For that we use an Optimistic Policy Iteration (OPI) algorithm, which means the policy will change in each iteration without waiting for the value convergence.
3. The Learning Process(4/4) • For the OPI, we will try two exploration methods: • Probabilistic Greedy • Softmax • Since the model’s state space is continuous, we use the On-Line GPTD algorithm for the value estimation.
4. On-Line GPTD Algorithm(1/4) • TD(l) – An algorithm family in which temporal differences are used to estimate the value function on-line. • GPTD – Gaussian Processes for TD learning: Assume that the sequence of rewards is a gaussian random process (with noise), and the rewards we get are samples of that process. We can estimate the value function using gaussian estimation and a kernel function.
4. On-Line GPTD Algorithm(2/4) • GPTD disadvantages: • Space consumption of O(t2). • Time consumption of O(t3). • The proposed solution: On-Line Sparsification applied on the GPTD algorithm.
4. On-Line GPTD Algorithm(3/4) On-Line Sparsification: Instead of keeping a large number of results of a vector function (function applied on a vector, yielding a vector), we keep a “dictionary” of input vectors that can span, up to an accuracy threshold, the original vector function’s space.
4. On-Line GPTD Algorithm(4/4) • Applying the on-line sparsification on the GPTD algorithm yields: • Recursive update rules. • No matrix inversion needed. • Matrix dimensions depend on mt (the dictionary size at time t), generally not linearly dependent of t. • Using those, we can calculate the value estimate and it’s variance with O(mt) and O(mt2) time, respectively.
5. Project Development Stages • Learning the usage of the octopus arm model. • Understanding the theoretical basis (RL & On-Line GPTD) • Adjusting the model program to our needs. • Implementing the On-Line GPTD algorithm for general purposes. • Implementing an agent that will use the model and the On-Line GPTD algorithm to perform the RL task. • Testing the learning program with different parameters to find optimal and interesting results: • Model parameters (activations, times, lengths, number of segments, etc…) • On-Line GPTD parameters (kernel functions, gaussian noise variance, discount factor, accuracy threshold). • Agent parameters (state exploration methods, goals, reward functions). • Conclusions.
6. Work done so far • Model code learned and adjusted to our needs. • After studying the theoretical basis, On-Line GPTD generic module was implemented. • Agent supporting different exploration methods was implemented. • All modules were successfully integrated in the C++ environment.
7. Program Structure Environment On-Line GPTD Agent Explorer Arm Model
8. Work left to be done • Testing the learning program with different parameters to find optimal and interesting results, as specified earlier. • Conclusions.