Last lecture summary

Last lecture summary

biologically motivated • synapses • Neuron accumulates (Σ) positive/negative stimuli from other neurons. • Then Σ is processed further – f(Σ)– to produce an output, i.e. neuron sends an output signal to neurons connected to it.

Neural networks for applied science and engineering, Samarasinghe

x – inputs w – weights f(Σ) – activation (tansfer) function y - output • threshold neuron (McCulloch-Pitts) • only binary inputs and output • the weights are pre-set, no learning • set the threshold so that the classification is correct

Heavyside (threshold) activation function

Threshold w0 is incorporated as a weight of one additional input with input value x0 = 1.0. • Such input is called bias.

Perceptron • binary classifier, maps its input x (real-valued vector) to f(x) – a binary value (0 or 1) • f(x) = • 1 … w∙x > 0 (including bias) • 0 … otherwise • perceptron can adjust its weights (i.e. can learn) – perceptron learning algorithm

Multiple output perceptron • for multicategory (i.e. more than 2 classes) classification • one output neuron for each class output layer input layer

Learning • Learning means there exist an algorithm for setting neuron’s weights (threshold w0 is also set). • delta rule – gradient descent • β – learning rate

iterative algorithm, one pass through the whole training set (epoch) is not enough • online learning • adjust weights after each input pattern presentation • weight oscillation may occur • batch learning • obtain the error gradient for each input pattern, average them at the end of the epoch

Supervised learning using delta rule • Transmit an input pattern through connections whose weights are initially set to random values. • The weighted inputs are summed, the output is produced, and is compared with the given target output to determine error for this pattern. • Inputs and target outputs are presented repeatedly, and the weights are adjusted using the delta rule at each iteration or after an epoch until the minimum possible square error is achieved. • This usually involves the iterative presentation of the entire training dataset many times.

New stuffFinishing perceptron

Perceptron failure • Please, help me and draw on the blackboard following functions: • AND, OR, XOR (eXclusive OR, true when exactly one of the operands is true, otherwise false) AND OR XOR 1 1 1 ??? 0 0 0 1 1 1 0 0 0

Play with http://lcn.epfl.ch/tutorial/english/perceptron/html/index.html

Perceptron uses linear activation function, so only linearly separable problems can be solved. • 1969 – famous book “Perceptrons” by Marvin Minsky and Seymour Papert showed that it was impossible for these classes of network to learn an XOR function. • They conjectured (incorrectly !) that a similar result would hold for a perceptron with three or more layers. • The often-cited Minsky/Papert text caused a significant decline in interest and funding of neural network research. It took ten more years until neural network research experienced a resurgence in the 1980s.

Play with http://www.eee.metu.edu.tr/~halici/courses/543java/NNOC/Perceptron.html

Multilayer perceptron

Nonlinear activation functions • So far we met threshold and linear activation functions. • They are linear, and conversely the solved problems must also be linear. • The nonlinearity is introduced by using nonlinear activation functions.

logistic (sigmoid, unipolar) tanh (bipolar)

Multilayer perceptron • MLP, the most famous type of neural network input layer hidden layer output layer

three-layer vs. two-layer input layer hidden layer output layer

Backpropagation training algorithm • How to train MLP? • Gradient descent type of algorithm called backpropagation. • MLP works in two passes: • forward pass • present a training sample to the neural network • compare the network's output to the desired output from that sample • calculate the error in each output neuron

backward pass • compute the amount ∆w by which the weights should be updated • first calculate gradient for hidden-to-output weights • then calculate gradient for input-to-hidden weights • the knowledge of gradhidden-output is necessary to calculate gradinput-hidden • update the weights in the network • It is a gradient descent method • learning rate β is used • can get trapped in local minima

input signal propagates forward error propagates backward

online learning vs. batch learning • In online learning the weights are changed after each presentation of a training pattern. • Weights may oscillate. • Suitable for online learning. • In batch learning, the total gradient for the whole epoch is represented as the sum of the gradient for each of the n patterns. • Batch learning improves the stability by averaging. • Another averaging approach providing stability is using the momentum.

This method basically tags the average of the past weight changes onto the new weight increment at every weight change, thereby smoothing out the net weight change. • Momentum μ is between 0 and 1. • It indicates the relative importance of the past weight change ∆wm-1on the new weight increment ∆wm • Thus, the current gradient and the past weight change together decide how much the new weight increment will be.

For example, if μ is equal to 0, momentum does not apply at all, and the past history has no place. • If μ is equal to 1, the current change is totally based on the past change. • Values of μ between 0 and 1 result in a combined response to weight change.

The equation is recursive , so the influence of the past weight change incorporates that of all previous weight changes as well. • Momentum can be used with both batch and online learning. • In batch learning, it can provide further stability to the gradient descent. • Momentum can be especially useful in online learning to minimize oscillations in error after the presentation of each pattern.

Delta-Bar-Delta • In backpropagation the same learning rate β applies to all of the weights. • More flexibility could be achieved if each weight is adjusted independently. • This method is called delta-bar-delta (TurboProp). • Each weight has its own learning rate, they’re adjusted as follows: • if the direction in which the error decreases at the current point is the same as the direction in which the error has been decreasing recently, then the learning rate is increased. • if the opposite is true, the learning rate is decreased

Second order methods • Surface curvature can be used to guide the error down the error surface more efficiently.

grad is a vector pointing in the direction of the greatest rate of increase of the function. How fast changes the rate of increase of the function in the small neighbourhood? This is given as the derivative of gradient, derivative of derivative, i.e. second derivative. The second derivatives with respect to all pairs of weights are given as the Hessian matrix.

Common methods using the Hessian • QuickProp • Gauss-Newton • Levenberg-Marquardt (LM) • These methods are order of magnitude faster (i.e. they reach minima in much less epochs) than first order methods (i.e. gradient based). • However, the efficiency is gained at a considerable computational cost. • Computing and inverting Hessian for large networks with large number of training patterns is expensive (large storage requirements) and slow.

Bias-variance • Just a small reminder • bias (lack of fit, undefitting) – model does not fit data enough, not enough flexible (too small number of parameters) • variance (overfitting) – model is too flexible (too much parameters), fits noise • bias-variance tradeoff – improving the generalization ability of the model (i.e. find the correct amount of flexibility)

Parameters in MLP: weights • If you use one more hidden neuron, the number of weights increases by how much? • # input neurons + # output neurons • If MLP is used for regression task, be careful! • To use MLP statistically correctly, the number of degrees of freedoms (i.e. weights) can’t exceed the number of data points. • Compare to polynomial regression example from the 2nd lecture

Improving generalization of MLP • Flexibility comes from hidden neurons. • Choose such a # of hidden neurons so neither undefitting, nor overfitting occurs. • Three most common approaches: • exhaustive search • early stopping • regularization

Exhaustive search • Increase a number of hidden units, and monitor the performance on the validation data set. number of neurons

Early stopping • fixed and large number of neurons is used • network is trained while testing its performance on a validation set at regular intervals • minimum at validation error – correct weights epochs

Weight decay • Idea: keep the growth of weights to a minimum in such a way that non-important weights are pulled toward zero • Only the important weights are allowed to grow, others are forced to decay • regularization

This is achieved not by minimizing MSE, but by minimizing • second term – regularization term • m – number of weights in the network • δ – regularization parameter • the larger the δ, the more important the regularization

Network pruning • Both early stopping and weight decay use all weights in the NN. They do not reduce the complexity of the model. • Network pruning – reduce complexity by keeping only essential weights/neurons. • Several pruning approaches, e.g. • optimal brain damage (OBD) • optimal brain surgeon (OBS) • optimal cell damage (OCD)

OBD • Based on sensitivity analysis • systematically change parameters in a model to determine the effects of such changes • Weights that are not important for input-output mapping are removed. • The importance (saliency) of the weight is measured based on the cost of setting a weight to zero.

How to perform OBD? • Train flexible network in a normal way (i.e. use early stopping, weight decay, …) • Compute saliency for each weight. Remove weight with small saliencies. • Train again the reduced network with kept weights. Initialize the training with their values obtained in the previous step. • Repeat from step 1.

Last lecture summary