370 likes | 583 Views
2806 Neural Computation Single Layer Perceptron Lecture 3. 2005 Ari Visa. Agenda. Some historical notes Linear adaptive filters Rosenblatt’s perceptron C onclusion s. Some historical notes . Rosenblatt’s perceptron 1958: a single neuron with adjustable synaptic weights and bias
E N D
2806 Neural ComputationSingle Layer Perceptron Lecture 3 2005 Ari Visa
Agenda • Some historical notes • Linear adaptive filters • Rosenblatt’s perceptron • Conclusions
Some historical notes Rosenblatt’s perceptron 1958: a single neuron with adjustable synaptic weights and bias Perceptron convergence theorem 1962 (Rosenblatt): If patterns used to train the perceptron are drawn from two linearly separable classes, then the algorithm converges and positions the decision surface in the form of hyperplane between two classes.
Some historical notes • A single neuron -> an adaptive filter • Widrow & Hoff (1960): least-mean square algorithm (LMS) = delta rule • The problem of designing an optimum linear filter: Kolmogorov 1942, Wiener 1949, Zadeh 1953, Gabor 1954
Linear Adaptive Filters Consider a dynamic system with an unknown mathematical characterization A set of labeled input-output data generated by the system at discrete instants of time at some uniform rate is available
Linear Adaptive Filters The problem is also known as system identification (in control theory). An adaptive filter: 1) Filtering process (involves the computation of two signals; an output y(i) and an error signal e(i) = d(i)-y(i)) 2) Adaptive process (the automatic adjustment of the synaptic weights of the neuron in accordance with the error signal e(i)).
Linear Adaptive Filters • Note that y(i) = v(i) = kmwk(i)xk(i) is written into matrix form y(i) = xT(i)w(i) • The manner in which e(i) = d(i)-y(i) is used to control the adjusments to the synaptic weights is determined by the cost function used to derive the adaptive filtering algorithm.
Linear Adaptive Filters • unconstrained optimization problem: • Minimize the cost function E(w) with respect to the weight vector w. • Operation analysis -> the necessary condition for optimality is E(w*) = 0. • The strategy is to use iterative steepest descent: Starting with an initial guess w(0), generate a sequence of weight vectors w(1), w(2), ... , such that the cost function E(w) is reduced at each iteration of the algorithm as E(w(n+1) < Ew(n)).
Linear Adaptive Filters • 1) Method of Steepest Descent • The direction of steepest descent is in direction opposite to the gradient vector g = E(w) • w(n+1) = w(n) –g(n) • is the stepsize or learning-rate parameter
Linear Adaptive Filters • The method of steepest descent converges to the optimal solution w* slowly. • The learning-rate parameter has an influence on the convergence behaviour: - When is small, the transient response of the algorithm is overdamped (3.2a) - When is large, the transient response of the algorithm is underdamped (3.2b) • - When exceeds a certain critical value, the algorithm becomes unstable.
Linear Adaptive Filters • 2) Newton’s method • The idea is to minimize the quadratic approximation of the cost function E(w) around the current point w(n). • Using a second-order Taylor series expansion of the cost function around the point w(n). • Ew(n) gT(n)w(n) +½ wT(n) H(n) w(n) • g(n) is the m-by-1 gradient vector of the cost function E(w) evaluated at the point w(n). The matrix H(n) is the m-by-m Hessian matrix of E(w) (second derivative), H = ²E(w)
Linear Adaptive Filters • H = ²E(w) requires the cost function E(w) to be twice continuously differentiable with respect to the elements of w. • Differentiating Ew(n) gT(n)w(n) +½ wT(n) H(n) w(n) with respect to w, the change E(n) is minimized when • g(n) + H(n)w(n) = 0 →w(n) = H-1(n)g(n) • w(n+1) = w(n) + w(n) • w(n+1) = w(n)+H-1(n)g(n) • where H-1(n) is the inverse of the Hessian of E(w).
Linear Adaptive Filters • Newton’s method converges quickly asymtotically and does not exhibit the zigzagging behavior. • Newton’s method requires that the Hessian H(n) has to be a positive definite matrix for all n!
Linear Adaptive Filters • 3) Gauss-Newton Method • It is applicable to a cost function that is expressed as the sum of error squares. • E(w) = ½i=1ne²(i), note that all the error terms are calculated on the basis of a weight vector w that is fixed over the entire observation interval 1 i n. • The error signal e(i) is a function of the adjustable weight vector w. Given an operating point w(n), we linearize the dependence of e(i) on w by writing e’(i,w) = e(i) + [e(i)/w]Tw=w(n) (w-w(n)), i=1,2,...,n
Linear Adaptive Filters e’(n,w) = e(n) + J(n)(w-w(n)) where e(n) is the error vector e(n) = [e(1),e(2),...,e(n)]T and J(n) is the n-by-m Jacobian matrix of e(n) (The Jacobian J(n) is the transpose of the m-by-n gradient matrix e(n), where e(n) =[e(1), e(2), ...,e(n)]. w(n+1) = arg min w {½e’(n,w)²} = ½e(n)² +eT(n)J(n)(w-w(n)) + ½(w-w(n))TJT(n)J(n)(w-w(n)) Differentiating the expression with respect w and setting the result equal to zero
Linear Adaptive Filters JT(n)e(n) + JT(n)e(n) (w-w(n)) = 0 w(n+1) = w(n) – (JT(n)J(n))-1JT(n)e(n) The Gauss-Newton requires only the Jacobian matrix of the error function e(n). For the Gauss-Newton iteration to be computable, the matrix product JT(n)J(n) must be nonsigular. JT(n)J(n) is always nonnegative definite but to ensure that it is nonsingular, the Jacobian J(n) must have row rank n. → add the diagonal matrix I to the matrix JT(n)J(n), the parameter is a small positive constant.
Linear Adaptive Filters • JT(n)J(n)+ I ; positive definite for all n. • -> The Gauss-Newton method is implemented in the following form: w(n+1) = w(n) – (JT(n)J(n) + I )-1JT(n)e(n) • This is the solution to the modified cost function: • E(w) = ½{w-w(0)²+ i=1ne²(i)} • where w(0) is the initial value of w.
Linear Adaptive Filters Linear Least-Square Filter: - the single neuron is linear - the cost function e(w) used to design the filter consists of the sum of error squares e(n) = d(n) –X(n)w(n), where d(n) is the n-by-1 desired response vector and X(n) is the n-by-m data matrix.
Linear Adaptive Filters w(n+1) =w(n) + (XT(n)X(n))-1XT(n)(d(n) – X(n)w(n)) = (XT(n)X(n))-1XT(n)d(n) where (XT(n)X(n))-1XT(n) is recognized as the pseudoinverse of the data matrix X(n) The weight vector w(n+1) solves the linear least-squares problem defined over an observation interval of duration n.
Linear Adaptive Filters Wiener Filter: ergodic environment that is also stationary -> We may substitute long-term sample averages/time-averages for expectations/ensemble averages Correlation matrix of the input vector x(i): Rx Cross-correlation vector between the input vector x(i) and desired response d(i): rxd
Linear Adaptive Filters Rx = E[x(i)xT(i)] = lim n 1/n XT(n)X(n) rxd= E[x(i)d(i)] = lim n 1/n XT(n)d(n) wo = lim nw(n+1) = Rx-1rxd where Rx-1is the inverse of the correlation matrix Rx. The weight vector wo is called the Wiener solution to the linear optimum filtering problem.
Linear Adaptive Filters For an ergodic process, the linear least-squares filter asymtotically approaches the Wiener filter as the number of observations approaches infinity. • The least-mean square (LMS) algorithm is based on the use of instantaneous values for the cost function. • E(w) = ½e²(n), where e(n) is the error signal measured at time n.
Linear Adaptive Filters • e(n) = d(n) - xT(n)w(n) -> an estimate for the gradient vector g^(n) = -x(n)e(n) • -> w^(n+1) = w^(n) +x(n)e(n) • where is the learning -rate parameter
Linear Adaptive Filters The feedback loop around the weight vector w^(n) in the LMS algorithm behaves like a low-pass filter. The inverse of the learning-rate parameter is a measure of the memory of the LMS algorithm.
Linear Adaptive Filters w^(n+1) = w^(n) + x(n)[d(n) - xT(n) w^(n)] w^(n) = z-1[w^(n+1)] A stochastic feedback system The learning-rate parameter and the input vector x(n) that determine the transmittance of the feedback loop
Linear Adaptive Filters The convergence issue that matters is convergence in the mean square E[e²(n)] constant as n Assumptions: 1) the successive input vectors are statistically independent of each other 2) At time step n the input vector is statistically independent of all previous samples of the desired response.
Linear Adaptive Filters 3) At time step n, the desired response is dependent on the input, but statistically independent of all previous values of the desired response. 4) The input vector and the desired response are drawn from Gaussian distributed populations. The convergence issue is guaranted if 0 < < 2/(sum of mean-square values of the sensor inputs) The LMS algorithm is model independent and therefore robust, which means that small model uncertainy and small disturbances can only result in small estimation errors.
Linear Adaptive Filters LMS algorithm is slow in convergence and sensitive to variations in the eigenstructure of the input. The LMS algorithm typically requires a number of iteration equal to about ten times the dimensionality of the input space for it to reach a steady-state condition. The LMS algorithm is particularly sensitive to variations in the condition number (=eigenvalue spread) of the correlarion matrix Rx of the input vector x. (Rx)maxmin
Linear Adaptive Filters The learning curve is a plot of the mean-square value of the estimation error Eav(n) versus the number of iterations n. Rate of convergence; the number of iterations n required for Eav(n) to decrease to some arbitrarily chosen value Misadjustment M = E()/Emin -1
Linear Adaptive Filters Difficulties encountered with the LMS algorithm may be attributed to the fact that the learning-rate parameter is maintained constant. (n) = c/n where c is a conctant. Search-then-converge schedule (n) = 0/(1+(n/t)) where t is a search time constant
Perceptron • Perceptron is based on a nonlinear neuron. • v = i=1mwixi + b • In the simplest form of the perceptron there are two decision regions reparated by a hyperplane.
Perceptron x(n) = [+1,x1(n),x2(n),...,xm(n)]T w(n) = [b(n),w1(n),w2(n),...,xm(n)]T v(n) = wT(n)x(n) defines a hyperplane
1. If the nth member of the training set, x(n), is correctly classified by the weight vector w(n) computed at the nth iteration of the algorithm, no correction is made to the weight vector of the perceptron in accordance with the rule: w(n+1) =w(n) if wTx(n) >0 and x(n) belongs to class L1 w(n+1) =w(n) if wTx(n) 0 and x(n) belongs to class L1 Perceptron
2. Otherwise the weight vector of the perceptron is uppdated in accordance with the rule w(n+1) =w(n) -(n)x(n) if wT(n)x(n) >0 and x(n) belongs to class L2 w(n+1) =w(n) + (n)x(n) if wT(n)x(n) 0 and x(n) belongs to class L1 where the learning-rate parameter (n) controls the adjustment applied to the weight vector at iteration n. Perceptron
Perceptron Fixed-increment convergence theorem: Let the subsets of training vectors H1 and H2 be linearly separable. Let the inputs presented to the perceptron originate from these two subsets. The perceptron converges after some n0 iterations, in the sense that w(n0) =w(n0 +1)=w(n0 +2)=... is a solution vector for n0 nmax .
Perceptron • The Perceptron Convergence Algorithm
Summary The LMS algorithm uses a linear neuron. Continuous learning takes place. The perceptron uses the McCulloch-Pitts formal model of a neuron. The learning process is performed for a finite number of iterations and then stops.