440 likes | 507 Views
Last lecture summary. biologically motivated synapses Neuron accumulates ( Σ ) positive/negative stimuli from other neurons. Then Σ is processed further – f( Σ ) – to produce an output, i.e. neuron sends an output signal to neurons connected to it.
E N D
biologically motivated • synapses • Neuron accumulates (Σ) positive/negative stimuli from other neurons. • Then Σ is processed further – f(Σ)– to produce an output, i.e. neuron sends an output signal to neurons connected to it.
Neural networks for applied science and engineering, Samarasinghe
x – inputs w – weights f(Σ) – activation (tansfer) function y - output • threshold neuron (McCulloch-Pitts) • only binary inputs and output • the weights are pre-set, no learning • set the threshold so that the classification is correct
Threshold w0 is incorporated as a weight of one additional input with input value x0 = 1.0. • Such input is called bias.
Perceptron • binary classifier, maps its input x (real-valued vector) to f(x) – a binary value (0 or 1) • f(x) = • 1 … w∙x > 0 (including bias) • 0 … otherwise • perceptron can adjust its weights (i.e. can learn) – perceptron learning algorithm
Multiple output perceptron • for multicategory (i.e. more than 2 classes) classification • one output neuron for each class output layer input layer
Learning • Learning means there exist an algorithm for setting neuron’s weights (threshold w0 is also set). • delta rule – gradient descent • β – learning rate
iterative algorithm, one pass through the whole training set (epoch) is not enough • online learning • adjust weights after each input pattern presentation • weight oscillation may occur • batch learning • obtain the error gradient for each input pattern, average them at the end of the epoch
Supervised learning using delta rule • Transmit an input pattern through connections whose weights are initially set to random values. • The weighted inputs are summed, the output is produced, and is compared with the given target output to determine error for this pattern. • Inputs and target outputs are presented repeatedly, and the weights are adjusted using the delta rule at each iteration or after an epoch until the minimum possible square error is achieved. • This usually involves the iterative presentation of the entire training dataset many times.
Perceptron failure • Please, help me and draw on the blackboard following functions: • AND, OR, XOR (eXclusive OR, true when exactly one of the operands is true, otherwise false) AND OR XOR 1 1 1 ??? 0 0 0 1 1 1 0 0 0
Play with http://lcn.epfl.ch/tutorial/english/perceptron/html/index.html
Perceptron uses linear activation function, so only linearly separable problems can be solved. • 1969 – famous book “Perceptrons” by Marvin Minsky and Seymour Papert showed that it was impossible for these classes of network to learn an XOR function. • They conjectured (incorrectly !) that a similar result would hold for a perceptron with three or more layers. • The often-cited Minsky/Papert text caused a significant decline in interest and funding of neural network research. It took ten more years until neural network research experienced a resurgence in the 1980s.
Play with http://www.eee.metu.edu.tr/~halici/courses/543java/NNOC/Perceptron.html
Nonlinear activation functions • So far we met threshold and linear activation functions. • They are linear, and conversely the solved problems must also be linear. • The nonlinearity is introduced by using nonlinear activation functions.
logistic (sigmoid, unipolar) tanh (bipolar)
Multilayer perceptron • MLP, the most famous type of neural network input layer hidden layer output layer
three-layer vs. two-layer input layer hidden layer output layer
Backpropagation training algorithm • How to train MLP? • Gradient descent type of algorithm called backpropagation. • MLP works in two passes: • forward pass • present a training sample to the neural network • compare the network's output to the desired output from that sample • calculate the error in each output neuron
backward pass • compute the amount ∆w by which the weights should be updated • first calculate gradient for hidden-to-output weights • then calculate gradient for input-to-hidden weights • the knowledge of gradhidden-output is necessary to calculate gradinput-hidden • update the weights in the network • It is a gradient descent method • learning rate β is used • can get trapped in local minima
input signal propagates forward error propagates backward
online learning vs. batch learning • In online learning the weights are changed after each presentation of a training pattern. • Weights may oscillate. • Suitable for online learning. • In batch learning, the total gradient for the whole epoch is represented as the sum of the gradient for each of the n patterns. • Batch learning improves the stability by averaging. • Another averaging approach providing stability is using the momentum.
This method basically tags the average of the past weight changes onto the new weight increment at every weight change, thereby smoothing out the net weight change. • Momentum μ is between 0 and 1. • It indicates the relative importance of the past weight change ∆wm-1on the new weight increment ∆wm • Thus, the current gradient and the past weight change together decide how much the new weight increment will be.
For example, if μ is equal to 0, momentum does not apply at all, and the past history has no place. • If μ is equal to 1, the current change is totally based on the past change. • Values of μ between 0 and 1 result in a combined response to weight change.
The equation is recursive , so the influence of the past weight change incorporates that of all previous weight changes as well. • Momentum can be used with both batch and online learning. • In batch learning, it can provide further stability to the gradient descent. • Momentum can be especially useful in online learning to minimize oscillations in error after the presentation of each pattern.
Delta-Bar-Delta • In backpropagation the same learning rate β applies to all of the weights. • More flexibility could be achieved if each weight is adjusted independently. • This method is called delta-bar-delta (TurboProp). • Each weight has its own learning rate, they’re adjusted as follows: • if the direction in which the error decreases at the current point is the same as the direction in which the error has been decreasing recently, then the learning rate is increased. • if the opposite is true, the learning rate is decreased
Second order methods • Surface curvature can be used to guide the error down the error surface more efficiently.
grad is a vector pointing in the direction of the greatest rate of increase of the function. How fast changes the rate of increase of the function in the small neighbourhood? This is given as the derivative of gradient, derivative of derivative, i.e. second derivative. The second derivatives with respect to all pairs of weights are given as the Hessian matrix.
Common methods using the Hessian • QuickProp • Gauss-Newton • Levenberg-Marquardt (LM) • These methods are order of magnitude faster (i.e. they reach minima in much less epochs) than first order methods (i.e. gradient based). • However the efficiency is gained at a considerable computational cost. • Computing and inverting Hessian for large networks with large number of training patterns is expensive (large storage requirements) and slow.
Bias-variance • Just a small reminder • bias (lack of fit, undefitting) – model does not fit data enough, not enough flexible (too small number of parameters) • variance (overfitting) – model is too flexible (too much parameters), fits noise • bias-variance tradeoff – improving the generalization ability of the model (i.e. find the correct amount of flexibility)
Parameters in MLP: weights • If you use one more hidden neuron, the number of weights increases by how much? • # input neurons + # output neurons • If MLP is used for regression task, be careful! • To use MLP statistically correct, the number of degrees of freedoms (i.e. weights) can’t exceed the number of data points. • Compare to polynomial regression example from the 2nd lecture
Improving generalization of MLP • Flexibility comes from hidden neurons. • Choose such a # of hidden neurons so neither undefitting, nor overfitting occurs. • Three most common approaches: • exhaustive search • early stopping • regularization
Exhaustive search • Increase a number of hidden units, and monitor the performance on the validation data set. number of neurons
Early stopping • fixed and large number of neurons is used • network is trained while testing its performance on a validation set at regular intervals • minimum at validation error – correct weights epochs
Regularization • Who remembers from the polynomial example what is regularization? • In NN called weight decay. • Idea: keep the growth of weights to a minimum in such a way that non-important weights are pulled toward zero • Only the important weights are allowed to grow, others are forced to decay
This is achieved not by minimizing MSE, but by minimizing • second term – regularization term • m – number of weights in the network • δ – regularization parameter • the larger the δ, the more important the regularization
Network pruning • Both early stopping and weight decay use all weights in the NN. They do not reduce the complexity of the model. • Network pruning – reduce complexity by keeping only essential weights/neurons. • Several pruning approaches, e.g. • optimal brain damage (OBD) • optimal brain surgeon (OBS) • optimal cell damage (OCD)
OBD • Based on sensitivity analysis • systematically change parameters in a model to determine the effects of such changes • Weights that are not important for input-output mapping are removed. • The importance (saliency) of the weight is measured based on the cost of setting a weight to zero.
The saliency can be computed from the Hessian. • Hessian is nonlocal – i.e. it uses the derivative with respect to all pairs of weights. Computationally costly for large networks. • Local approximation of Hessian – use only diagonal weights (i.e. ignore all second derivatives with respect to weights other than itself) • It implies that the weights of the network are independent
Saliency si of weight wiis defined as • Hii (diagonal entry of the Hessian) indicates the acceleration of the error with respect to a small perturbation to a weight wi. • By multiplying Hii by wi2 an indication of the total effect of wi on the error is obtained. • The larger the si, the larger the influence of wi on error.
How to perform OBD? • Train flexible network in a normal way (i.e. use early stopping, weight decay, …) • Compute saliency for each weight. Remove weight with small saliencies. • Train again the reduced network with kept weights. Initialize the training with their values obtained in the previous step. • Repeat from step 1.