Ch 5. Neural Networks (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006.

Ch 5. Neural Networks (1/2)Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Summarized by M.-O. Heo Biointelligence Laboratory, Seoul National University http://bi.snu.ac.kr/

Contents • 5.1 Feed-forward Network Functions • 5.1.1 Weight-space symmetries • 5.2 Network Training • 5.2.1 Parameter optimization • 5.2.2 Local quadratic approximation • 5.2.3 Use of gradient information • 5.2.4 Gradient descent optimization • 5.3 Error Backpropagation • 5.3.1 Evaluation of error-function derivatives • 5.3.2 A simple example • 5.3.3 Efficiency of backpropagation • 5.3.4 The Jacobian matrix • 5.4 The Hessian Matrix • 5.4.1 Diagonal approximation • 5.4.2 Outer product approximation • 5.4.3 Inverse Hessian • 5.4.4 Finite differences • 5.4.5 Exact evaluation of the Hessian • 5.4.6 Fast multiplication by the Hessian (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Feed-forward Network Functions • Goal: to extend linear model by making the basis functions depend on parameters, allow these parameters to be adjusted. : diff. nonlinear activation function Forward propagation of information First layer biases activations weights (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Feed-forward Network Functions (Cont’d) • A key difference compared to perceptron and multilayer perceptron (MLP) • MLP: continuous sigmoidal nonlinearities in the hidden units • Perceptron: step-function nonlinearities • Skip-layer connections • A network with sigmoidal hidden units can always mimic skip layer connections by using a sufficiently small first-layer weight that the hidden unit is effectively linear, and then compensating with a large weight value from the hidden unit to the output. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Feed-forward Network Functions (Cont’d) • The approximation properties of feed-forward networks • Universal approximator: a two-layer network with linear outputs can uniformly approximate any continuous function on a compact input domain to arbitrary accuracy provided the network has a sufficiently large number of hidden units. • Ex) N = 50, 2-layer network having 3 hidden units with tanh activation functions and linear output. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Feed-forward Network Functions - Weight-space symmetries • Sign-flip symmetry • If we change the sign of all of the weights and the bias feeding into a particular hidden unit, then, for a given input pattern, the sign of the activation of the hidden unit will be reversed. • Compensated by changing the sign of all of the weights of leading out of that hidden units. • For M hidden units, by tanh(-a) = - tanh(a), there will be M sign-flip symmetries. • Any given weight vector will be one of a set 2M equivalent weight vectors. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Feed-forward Network Functions - Weight-space symmetries (Cont’d) • Interchange symmetry • We can interchange the values of all of the weights (and the bias) leading both into and out of a particular hidden unit with the corresponding values of the weights (and bias) associated with a different hidden unit. • This clearly leaves the network input-output mapping function unchanged. • For M hidden units, any given weight vector will belong to a set of M! equivalent weight vectors (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Network Training • t has a Gaussian distribution with an x-dependent mean (likelihood function) (negative log) • Maximizing the likelihood function is equivalent to minimizing the sum-of-squares error function (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Network Training (Cont’d) • The choice of output unit activation function and matching error function • Standard regression problems: • Error function: Negative log-likelihood function • Output: identity • For multiple binary classification problems: • Error function: cross-entropy error function • Output: logistic sigmoid • For multiclass problems: • Multiclass cross-entropy error function • a softmax activation function (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Network Training - Parameter optimization • Error E(w) is a smooth continuous function of w and smallest value will occur at a point in weight space • Global minimum • Local minima • Most techniques involve choosing some initial valuefor weight vector and then moving through weight space in a succession of steps of the form • Many algorithms make use of gradient info. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Network Training - Local quadratic approximation • Taylor expansion of E(w) around some point • Local approximation to the gradient • The particular case of a local quadratic approximation when at w* (gradient) (Hessian Matrix) (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Network Training - Use of gradient information • In the quadratic approximation, computational cost to find minimum is O(W3) • W is the dimensionality of w • perform O(W2) evaluations, each of which would require O(W) steps. • In an algorithm that makes use of the gradient information, computational cost is O(W2) • By using error backpropagation, O(W) gradient evaluations and each such evaluation takes only O(W) steps. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Network Training - Gradient descent optimization • Weight update to comprise a small step in the direction of the negative gradient • Batch method • Techniques that use the whole data set at once. • The error function always decreases at each iteration unless the weight vector has arrived at a local or global minimum. • On-line version of gradient descent • Sequential gradient descent or stochastic gradient descent • Update to the weight vector based on one data point at a time. • Can handle redundancy in the data much more efficiently. • The possibility of escaping from local minima. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Error Backpropagation- Evaluation of error-function derivatives • Error Backpropagation • Apply an input vector Xn to the network and forward propagate through the network using equations below to find the activations of all the hidden and output units. • Evaluate the for all the output units using • Backpropagate the ’s using to obtain for each hidden unit in the network • Use to evaluate the required derivatives. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Error Backpropagation - Efficiency of backpropagation • Computational efficiency of backpropagation: O(W) • An alternative approach for computing derivatives of the error function: Using finite differences • The accuracy of the approximation to the derivatives can be improved by making ε smaller, until numerical roundoff limit. • Improved using symmetrical central differences => O(W2) (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Error Backpropagation - The Jacobian matrix • The technique of backpropagation can also be applied to the calculation of other derivatives. • The Jacobian matrix • Elements are given by the derivatives of the network outputs w.r.t. the inputs • Minimizing an error function E w.r.t. the parameter w (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Error Backpropagation - The Jacobian matrix (Cont’d) • A measure of the local sensitivity of the outputs to change in each of the input variables. • In general, the network mapping is nonlinear, and so the elements will not be constants. This is valid provided the are small. • The evaluation of the Jacobian Matrix (Sigmoidal activation function) (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

The Hessian Matrix • The Hessian matrix • The second derivatives form the elements of H • Important roles • Considering the second-order properties of the error surfaces. • Forms the basis of a fast procedure for re-training a feed-forward network following a small change in the training data • The inverse of the Hessian is used to identify the least significant weights in a network (pruning algorithm) • In the Laplace approximation for a Bayesian NN • Various approximation schemes are used to evaluate the Hessian matrix for a NN. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

The Hessian Matrix - Diagonal approximation • Inverse of the Hessian is useful, and inverse of diagonal matrix is trivial to evaluate. • Consider an error function • Replaces the off-diagonal elements with zeros • The diagonal elements of the Hessian: (neglect off-diagonal) (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

The Hessian Matrix - Outer product approximation (Levenberg-Marquardt) • Output y happen to be very close to the target values t, the second term will be small and can be neglected. • Or the value of y – t is uncorrelated with the value of the second derivative term, then the whole term will average to zero. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

The Hessian Matrix - Inverse Hessian • A procedure for approximating the inverse of the Hessian • First, we write the outer-product approximation • Derive a sequential procedure for building up the Hessian by including data points one at a time. • The initial matrix H0 = H + αI • α is a small quantity, not sensitive to the precise value of α. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

The Hessian Matrix - Finite differences • By using a symmetrical central differences formulation • Require O(W3) operations to evaluate the complete Hessian • By applying central differences to the first derivatives of the error function. Costs: O(W2) (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

The Hessian Matrix - Exact evaluation of the Hessian • The Hessian can also be evaluated exactly. • Using extension of the technique of backpropagation used to evaluate first derivatives. (Both weights in the second layer) (Both weights in the first layer) (Both weights in each layer) (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

The Hessian Matrix - Fast multiplication by the Hessian • Try to find an efficient approach to evaluating vTH directly. • O(W) operations • vTH = vT∇ (∇E) (∇: the gradient operator in weight space) • Introducing new notation R{w} = v to denote the operator vT ∇ • The Implementation of this algorithm involves the introduction of additional variables for the hidden units and for the output units. • For each input pattern, the values of these quantities can be found. (C) 2007, SNU Biointelligence Lab, http://bi.snu.ac.kr/

Ch 5. Neural Networks (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006.