WK1 - Introduction

WK1 - Introduction CS 476: Networks of Neural Computation WK2 – Perceptron Dr. Stathis Kasderidis Dept. of Computer Science University of Crete Spring Semester, 2009

Contents • Elements of Optimisation Theory • Definitions • Properties of Quadratic Functions • Model Algorithm for Smooth Functions • Classical Optimisation Methods • 1st Derivative Methods • 2nd Derivative Methods • Methods for Quadratic functions • Other Methods Contents

Contents II • Perceptron Model • Convergence Theorem of Perceptron • Conclusions Contents

Definitions • Types of Optimisation Problems: • Unconstrained • Linear Constraints • Non-linear Constraints • General nonlinear constrained optimisation problem definition: • NCP: minimise F(x) x  m • subject to ci(x)=0 i=1..m’ • ci(x)0 i=m’..m Optimisation

Definitions II • Strong local minimum: A point x* is a SLM of NCP if there exists >0 such that: • A1: F(x) is defined in N(x*, ); and • A2: F(x*)<F(y) for all y  N(x*, ), yx* • Weak local minimum: A point x* is a WLM of NCP if there exists >0 such that: • B1: F(x) is defined in N(x*, ); • B2: F(x*)  F(y) for all y  N(x*, ); and • B3: x* is not a strong local minimum Optimisation

Definitions III • UCP: minimise F(x) x  m • Necessary conditions for a minimum of UCP: • C1: ||g(x*)|| =0, i.e. x* is a stationary point; • C2: G(x*) is positive semi-definite. • Sufficient conditions for a minimum of UCP: • D1: ||g(x*)|| =0; • D2: G(x*) is positive definite. Optimisation

Definitions IV • Assume that we expand F in its Taylor series: • Where x,p  m, 0  1,  is a scalar and positive • Any vector p which satisfies: • is called descent direction at x* Optimisation

Properties of Quadratic Functions • Assume that a quadratic function is given by: • For some constant vector c and a constant symmetric matrix G (the Hessian matrix of ). • The definition of  implies the following relation between (x+p) and (x) for any vector x and scalar : Optimisation

Properties of Quadratic Functions I • The function  has a stationary point when: •  (x*) = Gx*+c = 0 • Consequently a stationary point must satisfy the system of the linear equations: • Gx* = -c • The system might have: • No solutions (c is not a linear combination of columns of G) • Many solutions (if G is singular) • A unique solution (if G is non-singular) Optimisation

Properties of Quadratic Functions II • If x* is a stationary point it follows that: • Hence the behaviour of  in a neighbourhood of x* is determiend by matrix G. Let j and uj denote the j-th eigenvalue and eigenvector of G. By definition: • G uj = j uj • The symmetry of G implies that the set of {uj }, j=1..m are orthonormal. So, when p is equal to uj : Optimisation

Properties of Quadratic Functions III • Thus the change in  when moving away from x* along the direction of uj depends on the sign of j : • If j >0  strictly increases as || increases • If j <0  is monotonically decreasing as || increases • If j =0 the value of  remain constant when moving along any direction parallel to uj • If G is positive definite, x* is the global minimum of  Optimisation

Properties of Quadratic Functions IV • If G is positive definite, x* is the global minimum of  • If G is positive semi-definite a stationary point (if exists) is a weak local minimum. • If G is indefinite and non-singular, x* is a saddle point and  is unbounded above and below Optimisation

Model Algorithm for Smooth Functions • Algorithm U(Model algorithm for m-dimensional unconstrained minimisation) • Let xk be the current estimate of x* • U1: [Test for convergence] If the conditions are satisfied, the algorithm terminates with xk as the solution; • U2: [Compute a search direction] Compute a non-zero m-vector pk the direction of search Optimisation

Model Algorithm for Smooth Functions I • U3: [Compute a step length] Compute a positive scalar k, the step length, for which it holds that: • F(xk+kpk) < F(xk) • U4: [Update the estimate of the minimum] Set: • Xk+1 xk + kpk , • k  k+1 • and go back to step U1. • To satisfy the descent condition (Fk+1<Fk) pk should be a descent direction: • gkTpk < 0 Optimisation

Classical Optimisation Methods: 1st derivative • 1st Derivative Methods: • A linear approximation to F about xk is: • F(xk+p) = F(xk) + gkTp • Steepest Descent Method:Select pk as: • p= - gk • It can be shown that: Optimisation

Classical Optimisation Methods I: 1st derivative •  is the spectral condition number of G • The steepest descent method can be very slow if  is large! • Other 1st derivative methods include: • Discrete Newton method: Approximates the Hessian, G, with finite differences of the gradient g • Quasi-Newton methods: Approximate the curvature: • skTGsk  (g(xk+sk) – g(xk))T sk Optimisation

Classical Optimisation Methods II: 2nd derivative • 2nd Derivative Methods: • A quadratic approximation to F about xk is: • F(xk+p) = F(xk) + gkTp+(1/2) pTGkp • The function is minimised my finding the minimum of the quadratic function : • This has a solution that satisfies the linear system: • Gkpk = -gk Optimisation

Classical Optimisation Methods III: 2nd derivative • The vector pk in the previous equation is called the Newton direction and the method is called the Newton method • Conjugate Gradient Descent: Another 2nd derivative method Optimisation

Classical Optimisation Methods IV: Methods for Quadratics • In many problems the function F(x) is a sum of squares: • The i-component of the m-vector is the function fi(x), and ||f(x)|| is called the residual at x. • Problems of this type appear in nonlinear parameter estimation. Assume that x  n is a parameter vector and ti is the set of independent variables. Then the least-squares problem is: • minimise , x  n Optimisation

Classical Optimisation Methods V: Methods for Quadratics • Where: • fi(x)=(x,ti)-yi and • Y=F(t) is the “true” function • yi are the desired responses • In the Least Squares Problem the gradient, g, and the Hessian, G, have a special structure. • Assume that the Jacobian matrix of f(x) is denoted by J(x) (a mxn matrix) and let the matrix Gi(x) denote the Hessian of fi(x). Then: Optimisation

Classical Optimisation Methods VI: Methods for Quadratics • g(x)=J(x)Tf(x) and • G(x)=J(x)TJ(x)+Q(x) • where Q is: • We observe that the Hessian is a special combination of first and second order information. • Least-square methods are based on the premise that eventually the first order term J(x)TJ(x) dominates the second order one, Q(x) Optimisation

Classical Optimisation Methods VII: Methods for Quadratics • The Gauss-Newton Method: • Let xk denote the current estimate of the solution; a quantity subscripted by k will denote that quantity evaluated at xk. From the Newton direction we get: • (JkTJk+Q)pk = -JkTfk • Let the vector pN denote the Newton direction. If ||fk||  0 as xk  x* the matrix Qk  0. Thus the Newton direction can be approximated by the solution of the equations: • JkTJkpk = -JkTfk Optimisation

Classical Optimisation Methods VIII: Methods for Quadratics • The solution of the above problem is given by the solution to the linear least square problem: • minimise , p  n • The solution is unique if Jk has full rank. The vector pGN which solves the linear problem is called the Gauss-Newton direction. This vector approximates the Netwon direction pN as ||Qk|| 0 Optimisation

Classical Optimisation Methods VIII: Methods for Quadratics • The Levenberg-Marquardt Method: • In this method the search direction is defined as the solution to the equations: • (JkTJk+kI)pk = -JkTfk • Where k is a non-negative integer. A unit step is taken along pk, i.e. • xk+1  xk + pk • It can be shown that for some scalar , related to k,the vector pk is the solution to the constrained subproblem: Optimisation

Classical Optimisation Methods VIIII: Methods for Quadratics • minimise , p  n • subject to ||p||2   • If k =0 pk is the Gauss Newton direction; • If k , || pk||  0 and pk becomes parallel to the steepest descent direction Optimisation

Other Methods • Other methods are based on function values only. This category includes methods such as: • Genetic Algorithms • Simulated Annealing • Tabu Search • Guided Local Search • etc Optimisation

Additional References • Practical Optimisation, P. Gill, W. Murray, M. Wright, Academic Press, 1981. • Numerical Recipes in C/C++, Press et al, Cambridege University Press, 1988 Optimisation

Perceptron • The model was created by Rosenblatt • It uses the nonlinear neuron of McCulloch-Pitts (which uses the Sgn as transfer function) Perceptron

Perceptron Output • The output y is calculated by: • Where sgn() is defined as: • The perceptron classifies an input vector x  m to one of two classes C1 or C2 Perceptron

Decision Boundary Perceptron • The case present above is called linearly separable classes

Learning Rule • Assume that vectors are drawn from two classes C1 and C2,i.e. • x1(1), x1(2), x1(3),…. belong to C1; and • x2(1), x2(2), x2(3),…. belong to C2 • Assume also that we redefine vectors x(n) and w(n) such as to include the bias, i.e. • x(n)=[+1,x1(n),…,xm(n)]T; and • w(n)=[b(n),w1(n),…,wm(n)]T x, w  m+1 Perceptron

Learning Rule II • Then there should exist a weight vector w such that: • wTx > 0, when x belongs to class C1; and • wTx  0, when x belongs to class C2; • We have selected arbitrary to include the case of wTx=0 to class C2 • The algorithm for adapting the weights of the perceptron may be formulated as follows: Perceptron

Learning Rule III • If the nth member of the training set, x(n) is correctly classified by the current weight set, w(n) no correction is needed, i.e. • w(n+1)=w(n) , if wTx(n) > 0 and x(n) belongs to class C1; • w(n+1)=w(n) , if wTx(n)  0 and x(n) belongs to class C2; Perceptron

Learning Rule IV • Otherwise, the weight vector is updated according to the rule: • w(n+1)=w(n)- (n)x(n), • if wTx(n) > 0 and x(n) belongs to class C2; • w(n+1)=w(n)+ (n)x(n), • if wTx(n)  0 and x(n) belongs to class C1; • The parameter (n) is called learning rate and controls the adjustment to the weight vector at iteration n. Perceptron

Learning Rule V • If we assume that the desired response is given by: • Then we can re-write the adaptation rule in the form of an error-correction learning rule: • w(n+1)=w(n)+ [d(n)-y(n)]x(n) • Where the e(n)= [d(n)-y(n)] is the error signal • The learning rate is a positive constant in the range 0<  1. Perceptron

Learning Rule VI • When we assign a value to  in the range (0,1] we must keep in mind two conflicting requirements: • Averaging of past inputs to provide stable weights estimates, which requires a small  • Fast adaptation with respect to real changes in the underlying distributions of the process responsible for the generation of the input vector x, which requires a large  Perceptron

Summary of the Perceptron Algorithm • Variables and Parameters: • x(n)=(m+1)-by-1 input vector • = [+1, x1(n),…, xm(n)]T • w(n)=(m+1)-by-1 weight vector • = [b(n), w1(n),…, wm(n)]T • b(n)=bias • y(n)=actual response • d(n)=desired response • =learning rate in (0,1] Perceptron

Summary of the Perceptron Algorithm I • Initialisation: Set w(0)=0. Then perform the following computations for time steps n=1,2,…. • Activation: At time step n, activate the perceptron by applying the input vector x(n) and desired response d(n). • Computation of Actual Response: Compute the actual response of the perceptron by using: • y(n)=Sgn(wT(n)x(n) ) • where Sgn(•) is the signum function. Perceptron

Summary of the Perceptron Algorithm II • Adaptation of Weight Vector: Update the weight vector of the perceptron by using: • w(n+1)=w(n)+ [d(n)-y(n)]x(n) • where: • Continuation: Increment the time step n by one and go back to step 2. Perceptron

Perceptron Convergence Theorem • We present a proof of the fact that the perceptron needs only a finite number of steps in order to converge (i.e. to find the correct weight vector if this exists) • We assume that w(0)=0. If this is not the case the proof still stands but the number of the number of steps that are needed for convergence is increased or decreased • Assume that vectors are which are drawn from two classes C1 and C2,form two subsets, i.e. • H1={x1(1), x1(2), x1(3),….} belong to C1; and • H2={ x2(1), x2(2), x2(3),….} belong to C2 Convergence

Perceptron Convergence Theorem I • Suppose that w(n)Tx(n) < 0 for n=1,2,… and the input vector belongs to H1 • Thus for (n)=1 we can write the (actually incorrect) weight update equation as: • w(n+1)=w(n)+ x(n), • forx(n) belonging to class C1; • Given the initial condition w(0)=0, we can solve iteratively the above equation and obtain the result: • w(n+1)=x(1)+x(2)+…+x(n) (E.1) Convergence

Perceptron Convergence Theorem II • Since classes C1 and C2 are linearly separable there exists w0 such that wTx(n) > 0 for all vectors x(1), x(2), …, x(n) belonging to H1. For a fixed solution w0 we can define a positive number  as: Convergence • Multiplying both of E.1 with w0T we get: • w0T w(n+1)= w0Tx(1)+ w0Tx(2)+…+ w0Tx(n) • So we have finally: • w0T w(n+1)  n  (E.2)

Perceptron Convergence Theorem III • We use the Cauchy-Schwarz inequality for two vectors which states that: • ||w0||2|| w(n+1)||2  [w0Tw(n+1)]2 • Where ||•|| denotes the Euclidean norm of the vector and the inner product w0Tw(n+1) is a scalar. Then from E.2 we get: • ||w0||2|| w(n+1)||2  [w0Tw(n+1)]2  n2 2 • or alternatively: Convergence

Perceptron Convergence Theorem IV • Now using: • w(k+1)=w(k)+ x(k), • forx(k) belonging to class C1; k=1,..,n • And taking the Euclidean norm we get: • ||w(k+1)||2=||w(k)||2+ ||x(k)||2+2 w(k)Tx(k), • Which under the assumption of wrong classification (i.e. wTx(n) < 0 ) leads to: • ||w(k+1)||2  ||w(k)||2+ ||x(k)||2 Convergence

Perceptron Convergence Theorem V • Or finally to: • ||w(k+1)||2 -||w(k)||2  ||x(k)||2 • Adding all these inequalities for k=1,…,n and using the initial condition w(0)=0 we get: • Where  is a positive number defined as: Convergence

Perceptron Convergence Theorem VI • E.4 states that the Euclidean norm of vector w(n+1) grows at most linearly with the number of iterations n. • But this result is in conflict with E.3 for large enough n. • Thus we can state that n cannot be larger than some value nmax for which both E.3 and E.5 are simultaneously satisfied with the equality sign. That is nmax is the solution of the equation: Convergence

Perceptron Convergence Theorem VII • Solving for nmax we get: • This proves the perceptron algorithm will terminate after a finite number of steps. However, observe that there exists no unique solution for nmax due to non-uniqueness of w0 Convergence

Conclusions • There are many optimistion methods. With decreasing power we present the methods in the following list: • 2nd derivative methods: e.g. Netwon • 1st derivative methods: e.g. Quasi-Newton • Function value based methods: e.g. Genetic algorithms • The perceptron is a model which classifies an input vector to one of two exclusive classes C1 and C2 • The perceptron uses an error-correction style rule for weight update Conclusions

Conclusions I • The perceptron learning rule converges in finite number of steps Conclusions

WK1 - Introduction