140 likes | 218 Views
On-Line Learning with Recycled Examples: A Cavity Analysis. Peixun Luo and K. Y. Michael Wong Hong Kong University of Science and Technology. y. J j. Formulation. Inputs: ξ j , j = 1, ..., N Weights: J j j = 1, …, N Activation: y = J · ξ Output: S = f ( y ). y. J j.
E N D
On-Line Learning with Recycled Examples:A Cavity Analysis Peixun Luo and K. Y. Michael Wong Hong Kong University of Science and Technology
y Jj Formulation Inputs: ξj, j = 1, ..., N Weights: Jjj = 1, …, N Activation: y = J·ξ Output: S = f(y)
y Jj The Learning of a Task Given p = αN examples with inputs: ξjμj = 1, ..., N, μ = 1, …, p outputs: yμ generated by a teacher network Learning is done by defining a risk function and minimizing it by gradient descent.
Learning Dynamics Define a cost function in terms of the examples. E = Σ μEμ+ regularization terms On-line learning: At time t, draw an example σ(t) and: ΔJj ~ Gradient with respect to σ(t) + weight decay Batch learning: At time t, ΔJj ~ Average gradient with respect to all examples + weight decay
The Cavity Method It has been applied to many complex systems. It has been applied to steady-state properties of learning. It uses a self-consistency argument to consider what happens when a set of p examples is expanded to p + 1examples. The central quantity is the cavity activation, which is the activation of example 0 in a network which learns examples 1 to p (but never learns example 0). Since the original network has no information about example 0, the cavity activation obeys a random distribution (e.g. a Gaussian). Now suppose the network incorporates example 0 at time s. The activation is no longer random.
X(t) random diffusion h(t) stimulation time s time Linear Response The cavity activation diffuses randomly. The generic activation, receiving a stimulus at time s, is no longer random. The background examples also adjust due to the newcomer. Assuming that the background adjustments are small, we can use linear response theory to superpose the effects due to all previous times s.
Useful Equations For batch learning: Generic activation of an example at time t = cavity activation of the example at time t + integrates(Green’s function from time s to t) x(gradient term at time s). For on-line learning: Generic activation of an example at time t = cavity activation of the example at time t + summations(Green’s function from time s to t) x(gradient term at time s). The learning instants s are Poisson distributed.
learning instants (Poisson distributed) theory and simulations agree! generic activation (with giant boosts) Simulation Results (line) cavity activation from theory (dots) simulation with example removed
theory and simulations agree! Further Development generalization error training error
critical learning rate at which learning diverges other approximations Critical Learning Rate (1) theory and simulations agree!
critical learning rate at which learning diverges other approximations Critical Learning Rate (2) theory and simulations agree!
generalization error drops when the dynamics is averaged over monitoring periods Average Learning theory and simulations agree!
Conclusion We have analysed the dynamics of on-line learning with recycled examples using the cavity approach. Theory is able to reproduce the Poisson-distributed giant boosts of the activations during learning. Theory and simulations agree well on: the evolution of the training and generalization errors, the critical learning rate at which learning diverges, the performance of average learning. Future: to develop a Monte Carlo sampling procedure for multilayer networks.