1 / 13

COMP/EECE 7/8740 Neural Networks

COMP/EECE 7/8740 Neural Networks. Approximations by Perceptrons and Single-Layer NNs Robert Kozma February 11, 2003. Linear Discriminant Function. Input: X=[x_1, x_2,…, x_d] T , d-dimensional input space T means transposed vector or matrix Linear combination of input data:

Download Presentation

COMP/EECE 7/8740 Neural Networks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. COMP/EECE 7/8740 Neural Networks Approximations by Perceptrons and Single-Layer NNs Robert Kozma February 11, 2003

  2. Linear Discriminant Function • Input: X=[x_1, x_2,…, x_d]T, • d-dimensional input space • T means transposed vector or matrix • Linear combination of input data: • Y(X) = WT X + W0 • W = [w+1, w_2,…, w_d]T – weight vector • w_o = constant -- bias offset • Y(X) = S w_i x_i + w_o • Discriminant function (threshold) • Class 1 if Y(X) < 0 • Class 2 if Y(X) > 0

  3. Meaning of Linear Discriminant • x_1 Class 1 (Y > 0) Y(X) = 0 x_2 Class 2 (Y < 0) If W_0 = 0 Bias shows the offset from origin

  4. Nodal Representation of Discriminant as Neural Net • Linear combination • Y(X) = WT X + W0 : Y > 0 or Y < 0 discriminant Y(X) w_1 w_d w_0 x_1 x_2 x_3 x_4 … x_d x_0

  5. Single Layer Neural NetMulti-Class Recognition Problem • Linear discriminant • K – number of classes, k = 1,…, K • Yk(X) = WkT X + Wk0 • choose max discriminant: Yk(X) = max {Yi(X)} • Yk(X) = S wki xi + wk0 , k = 1, …, K • Where summation is over all input dimension d

  6. Multi-class Recognition • Yk(X) = S wki xi + wk0 , k = 1, …, K • Max{Yk(X) } Y1(X) … Yk(X) Output Input w_11 w_1d w_0 x_1 x_2 x_3 x_4 … x_d x_0

  7. Include Nonlinearity in Output • Yk(X) = g (S wki xi + wk0 ), k = 1, …, K Y1(X) … Yk(X) g( ) g( ) Output Input w_11 w_1d w_0 x_1 x_2 x_3 x_4 … x_d x_0

  8. Motivation of Nonlinear Transfer Function • NN outputs are interpreted as posteriory probabilities • > these are not necessarily linear in inputs • We can derive for normal class-conditional probabilities • Y_k  P(C_k | X ) = 1 / {1 + exp(A)} • Where: A = WT X + W0 • This is called sigmoid nonlinearity • Display and see fig. 3.5

  9. Derivation of Decision Criterion • Goal: minimize probability of misclassification • Bayesian rule: • choose P(Ck|X) > P(cj|X) for any j p(x|Ck) P(Ck)/p(X) > p(x|Cj) P(Cj)/p(X), here p(X) const • p(x|Ck) P(Ck)/p(X) > p(x|Cj) P(Cj)/p(X) • Interpretation: • Threshold decision (two classes) • Draw 2 normal curves and illustrate the meaning • Next : log transformation…

  10. NN Decision Criterion (cont’d) • Log is a monotonous function, so: log(p(x|Ck) P(Ck)) > log(p(x|Cj) P(Cj)) log(p(x|Ck)) + log( P(Ck)) > log(p(x|Cj) + log(P(Cj)) Rearrange it and define Y = Yk/Yj: Y(x) = log{(p(x|Ck)/p(x|Cj)} + log{P(Ck)/P(Cj)} > 0 • This is an alternative form of the decision rule • Why?! • Plug in class-conditional and prior probabilities to get a decision criterion

  11. Actual Form of Class Conditional Probability • Common: normal/Gaussian distribution • N(s,m) • P(x|Ck) = 1/sqrt(2Ps2k ) exp{-(x-mk)2/2sk} • Substitute it to Bayes rule • Yk(x) = -1/2 (x-mk)2/sk –1/2ln sk + lnP(Ck) • For Normal class-conditional probabilities • The decision criterion is quadratic in X!! • This makes life easy and given a justification of SSE criteria later

  12. Advantages of Normal Class Conditional Probabilities + • Simplicity of treatise • Quadratic (2nd order) decision criterion • 2 parameters s,m • Good approximation in a lot of real-life problems • CLT: central limit theorem: xi iid rv, I=1,…, n • Then (x1+x2+…xn)/n  N(s,m), when n  inf. • Linear transformation preserves the properties • There is a linear transformation of the variables that diagonalizes the covariance • Related to entropy maximum principle

  13. Covariance Matrix • Multivariate process X={x1, x2, .., xd}T • Normal distribution P(x) = 1/{(2P)d/2 |S|1/2} exp{-1/2 (x-mk)TS-1 (x-mk)} • Here m = {m1,m2, …, md}T -- mean vector • mi = E(xi) = (-inf,+inf) xi p(xi) dxi • S is covariance matrix: Sij, i,j = 1,…,d • Sij = E((xi- mi)) (xj- mj) ) =   (xi- mi)) (xj- mj) p(xi ‘xj) dxi dxj • Diagonalization means: S matrix is diagonal • No offdiagonal covariance components

More Related