100 likes | 122 Views
Training with Hints. References. Abu-Mostafa, Y. S. “Learning from Hints in Neural Networks,” Journal of Complexity 6, 1990 Al-Mashouq, K., and Reed, I. “Including Hints in Training Neural Nets” Neural Computation 3, 1991
E N D
References • Abu-Mostafa, Y. S. “Learning from Hints in Neural Networks,” Journal of Complexity 6, 1990 • Al-Mashouq, K., and Reed, I. “Including Hints in Training Neural Nets” Neural Computation 3, 1991 • Abu-Mostafa, Y. S. “A Method for Learning from Hints” Advances in Neural Information Processing Systems, 1996
Stuff you already know • Neural net tries to learn function: f: X Y • Output of neural net is: g: X Y • Some approximation to f • Error measure ε(g,f) • Typically ε = E[(g(x)-f(x))2] • Learning takes place via set of examples: {(x1, f(x1)),…(xN,f(xN))}
Hints • Set of data examples is a special case of a “Hint”: we’ll call it H0 • Other “hint sets” will be denoted by H1,…, Hm • Determine Hm’s from a priori knowledge of underlying function f • e.g., invariance, monotonicity, even-or-odd, etc.
Hints • Idea behind hints: create “hint” data such that we have pairs of data points, {fm(x), gm(x)}, in such a way that we can minimize ε(fm(x), gm(x)), where ε(.) is the error function of the neural network • We can then backpropagate that error to update the weights
Example: Invariance • Say we have an invariance in the function, such that, given two distinct inputs x1and x2,we have the relationship that f(x1)=f(x2) • We then minimize the error (y1-y2)2, where y denotes the output of the neural net. This yields:
Other examples • f is an even function: • εm=(g(x)-g(-x)) • f is monotonic: given that, for x1and x2, f(x1) < f(x2): • εm=(g(x1)-g(x2))2, if g(x1)>g(x2) =0, else • f is known to lie within [ax, bx] for given x: • εm=(g(x)-ax)2 if g(x)<ax εm=(g(x)-bx)2 if g(x)>bx 0 else
Average Error • Rather than back-propagating εm, we should select a large number of examples N of each hint of type Hm, and update the weights based on:
Learning Schedule • We wish to minimize the penalty function: • Where the αm’s represent scaling factors weighting the importance of each hint • αm’s often not known or effectively knowable • Instead, use a learning schedule, focusing on a single Hm at a time, based on some algorithm
Learning Schedule: Examples • Simple Rotation: rotate from H0…Hm in a fixed, uniform manner • Effective when Em’s are similar • Weighted Rotation: rotate between hints based on importance or difficulty of learning of each hint • Problems similar to using αm’s • Maximum Error/ Max Weighted Error: at each step, algorithm updates based on hint with largest Em, or weighted b*Em • Adaptive minimization: for each Em, estimate total E as a function of all other Em’s • Choose hint for which corresponding estimate is the smallest.