170 likes | 368 Views
ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 17: TRTRL, Implementation Considerations, Apprenticeship Learning. November 3, 2010. Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science The University of Tennessee Fall 2010.
E N D
ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 17: TRTRL, ImplementationConsiderations, Apprenticeship Learning November 3, 2010 Dr. Itamar Arel College of Engineering Department of Electrical Engineering and Computer Science The University of Tennessee Fall 2010
Outline • Recap on RNNs • Implementation and usage issues with RTRL • Computational complexity and resources required • Vanishing gradient problem • Apprenticeship learning
Recap on RNNs • RNNs are potentially much stronger than FFNN • Can capture temporal dependencies • Embed complex state representation (i.e. memory) • Models of discrete-time dynamic systems • They are (very) complex to train • TDNN – limited performance based on window • RTRL – calculates a dynamic gradient on-line
RTRL reviewed • RTRL is a gradient descent based method • It relies on sensitivities expressing the impact of any weight wij on the activation of neuron k. • The algorithm then consists of computing weight changes • Let’s look at the resources involved …
Implementing RTRL – computations involved • The key component in RTRL is the sensitivities matrix • Must be calculated for each neuron • RTRL, however, is NOT local … • Can the calculations be efficiently distributed? N3 N N4
Implementing RTRL – storage requirements • Let’s assume a fully-connected network of N neurons • Memory resources • Weights matrix, wij N2 • Activations, yk N • Sensitivity matrix N3 • Total memory requirements O(N3) • Let’s go over an example: • Let’s assume we have 1000 neurons in the system • Each value requires 20 bits to represent • ~20 Gb of storage!!
Possible solutions – static subgrouping • Zipser et. al (1989) suggested static grouping of neurons • Relaxing the “fully-connected” requirement • Has backing in neuroscience • Average “branching factor” in the brain ~ 1000 • Reduced the complexity by simply leaving out elements of the sensitivity matrix based upon subgrouping of neurons • Neurons are subgrouped arbitrarily • Sensitivities between groups are ignored • All connections still exist in the forward path • If g is the number of subgroups then … • Storage is O(N3/g2 ) • Computational speedup is g3 • Communications each node communicates with N/gnodes
Possible solutions – static subgrouping (cont.) • Zipser’s empirical tests indicate that these networks can solve many of the problems full RTRL solves • One caveat of the subgrouped RTRL training is that each subnet must have at least one unit for which a target exists (since gradient information is not exchanged between groups) • Others have proposed dynamic subgrouping • Subgrouping based on maximal gradient information • Not realistic for hardware realization • Open research question: how to calculate gradient without the O(N3) storage requirement?
Truncated Real Time Recurrent Learning (TRTRL) Motivation: To obtain a scalable version of the RTRL algorithm while minimizing performance degradation How? Limit the sensitivities of each neuron to its ingress (incoming) and egress (outgoing) links
Performing Sensitivity Calculations in TRTRL • For all nodes that are not in the output set, the egress sensitivity values for node i are calculated by imposing k=jin the original RTRL sensitivity equation, such that • Similarly, the ingress sensitivity values for node j are given by • For output neurons, a nonzero sensitivity element must exist in order to update the weights
Resource Requirements of TRTRL • The network structure remains the same with TRTRL, only the calculation of sensitivities is reduced • Significant reduction in resource requirements … • Computational load for each neuron drops to from O(N3) to O(2KN), where K denotes the number of output neurons • Total computational complexity is now O(2KN2) • Storage requirements drop from O(N3) to O(N2) • Example revisited: For N=100, 10 outputs 100k multiplications and only 20kB of storage!
Input Output Further TRTRL Improvements – Clustering of Neurons • TRTRL introduced localization and memory improvement • Clustered TRTRL adds scalability by reducing the number of long connection lines between processing elements
Test case #1: Frequency Doubler • Input: sin(x), target output sin(2x) • Both networks had 12 neurons
Vanishing Gradient Problem • Recap on goals: • Find temporal dependencies in data with a RNN • The idea behind RTRL: when an error value is found, apply it to inputs seen an indefinite number of epochs ago • In 1994 (Bengio et. al) it has been shown that both BPTT and RTRL suffer from the problem of vanishing gradient information • When using gradient based training rules, the “error signal” that is applied to previous inputs tends to vanish • Because of this, long-term dependencies in the data are often overlooked • Short-term memory is ok, long-term (>10 epochs) – lost
Vanishing Gradient Problem (cont.) • A learning error yields gradients on outputs, and therefore on the state variables st • Since the weights (parameters) are shared across time xt yt st RNN
What is Apprenticeship Learning • Many times we want to train an agent based on a reference controller • Riding a bicycle • Flying a plane • Starting from scratch may take a very long time • Particularly for large state/action spaces • May cost a lot (e.g. helicopter crashing) • Process: • Train agent on reference controller • Evaluate trained agent • Improve trained agent • Note: reference controller can be anything (e.g. heuristic controller for Car Race problem)
Formalizing Apprenticeship Learning • Let’s assume we have a reference policy p from which we want our agent to learn • We would first like to learn the (approx.) value function, Vp • Once we have Vp, we can try an improve it based on the policy improvement theorem, i.e. • By following the original policy greedily we obtain a better policy! • In practice, many issues should be considered such as state space coverage and exploration/exploitation • Train on zero exploration, then explore gradually …