1 / 39

Artificial Neural Networks

Artificial Neural Networks. Dr. Lahouari Ghouti Information & Computer Science Department. Single-Layer Perceptron (SLP). Architecture. We consider the following architecture : feed-forward neural network with one layer

cirila
Download Presentation

Artificial Neural Networks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Artificial Neural Networks Dr. Lahouari Ghouti Information & Computer Science Department Artificial Neural Networks

  2. Single-Layer Perceptron(SLP) Artificial Neural Networks

  3. Architecture • We consider the following architecture: feed-forward neural network with one layer • It is sufficient to study single-layer perceptrons with just one neuron: Artificial Neural Networks

  4. +1 IF z >= 0 g(z) = Is the function sign(z) -1 IF z < 0 Perceptron: Neuron Model • Uses a non-linear (McCulloch-Pitts) model of neuron: b(bias) x1 w1 z y x2 w2 g(z) wm xm • g is the sign function: Artificial Neural Networks

  5. Perceptron: Applications • The perceptron is used for classification (?): classify correctly a set of examples into one of the two classes C1, C2: • If the output of the perceptron is +1 then the input is assigned to class C1 • If the output is -1 then the input is assigned to C2 Artificial Neural Networks

  6. Perceptron: Classification • The equation below describes a hyperplane in the input space. This hyperplane is used to separate the two classes C1 and C2 decision region for C1 x2 w1x1 + w2x2 + b > 0 decision boundary C1 x1 decision region for C2 C2 Weighted Bias w1x1 + w2x2 + b <= 0 w1x1 + w2x2 + b = 0 Artificial Neural Networks

  7. Perceptron: Limitations • The perceptron can only model linearly-separablefunctions. • The perceptroncan be used to model the following Boolean functions: • AND • OR • COMPLEMENT • But it cannot model the XOR. Why? Artificial Neural Networks

  8. x2 1 1 -1 0 -1 1 0 1 x1 Perceptron: Limitations (Cont’d) • The XOR is not a linearly-separable problem • It is impossible to separate the classes C1 and C2 with only one line C1 C2 C1 Artificial Neural Networks

  9. Perceptron: Learning Algorithm • Variables and parameters: x(n) = input vector = [+1, x1(n), x2(n), …, xm(n)]T w(n) = weight vector = [b(n), w1(n), w2(n), …, wm(n)]T b(n) = bias y(n) = actual response d(n) = desired response  = learning rate parameter (More elaboration later) Artificial Neural Networks

  10. +1 if x(n)  C1 Where d(n) = -1 if x(n)  C2 The Fixed-Increment Learning Algorithm • Initialization: set w(0) =0 • Activation: activate perceptron by applying input example (vector x(n) and desired response d(n)) • Compute actual response of the perceptron: y(n) = sgn[wT(n)x(n)] • Adapt the weight vector: if d(n) and y(n) are different then w(n + 1) = w(n) + [d(n)-y(n)]x(n) • Continuation: increment time index n by 1 and go to Activation step Artificial Neural Networks

  11. A Learning Example • Consider a training set C1 C2, where: • C1 = {(1,1), (1, -1), (0, -1)} elements of class 1 • C2 = {(-1,-1), (-1,1), (0,1)} elements of class -1 • Use the perceptron learning algorithm to classify these examples. • w(0) = [1, 0, 0]T = 1 Artificial Neural Networks

  12. A Learning Example (Cont’d) Decision boundary: 2x1 - x2 = 0 x2 1 + C2 x1 -1 1 1/2 -1 C1 - - - + + Artificial Neural Networks

  13. The Learning Algorithm: Convergence • Let n = Number of training samples (Set X); • X1= Set of training sample belonging to class C1; • X2= set of training sample belonging to C2 • For a given sample n: x(n) = [+1, x1(n),…, xp(n)]T = input vector w(n) = [b(n), w1(n),…, wp(n)]T = weight vector Net activity Level: v(n) = wT(n)x(n) Output: y(n) = +1 if v(n) >= 0 -1 if v(n) < 0 Artificial Neural Networks

  14. The Learning Algorithm: Convergence (Cont’d) • The decision hyperplane separates classes C1 and C2 • If the two classes C1 and C2 are linearly separable, then there exists a weight vector w such that wTx≥ 0 for all x belonging to class C1 wTx< 0 for all x belonging to class C2 Artificial Neural Networks

  15. Error-Correction Learning • Update rule: w(n + 1) = w(n) + Δw(n) • Learning process • If x(n) is correctly classified by w(n), then w(n + 1) = w(n) • Otherwise, the weight vector is updated as follows w(n + 1) = w(n) – η(n)x(n) if w(n)Tx(n) ≥ 0; x(n) belongs to C2 w(n) + η(n)x(n) if w(n)Tx(n) < 0; x(n) belongs to C1 Artificial Neural Networks

  16. Perceptron Convergence Algorithm • Variables and parameters • x(n) = [+1, x1(n),…, xp(n)]; w(n) = [b(n), w1(n),…,wp(n)] • y(n) = actual response (output); d(n) = desired response • η = learning rate, a positive number less than 1 • Step 1: Initialization • Set w(0) = 0, then do the following for n = 1, 2, 3, … • Step 2: Activation • Activate the perceptron by applying input vector x(n) and desired output d(n) Artificial Neural Networks

  17. Perceptron Convergence Algorithm (Cont’d) • Step 3: Computation of actual response y(n) = sgn[wT(n)x(n)] • Where sgn(.) is the signum function • Step 4: Adaptation of weight vector w(n+1) = w(n) + η[d(n) – y(n)]x(n) Where d(n) = • Step 5 • Increment n by 1, and go back to step 2 +1 if x(n) belongs to C1 -1 if x(n) belongs to C2 Artificial Neural Networks

  18. Learning: Performance Measure • A learning rule is designed to optimize a performance measure • However, in the development of the perceptron convergence algorithm we did not mention a performance measure • Intuitively, what would be an appropriate performance measure for a classification neural network? • Define the performance measure: J = -E[e(n)v(n)] Artificial Neural Networks

  19. Learning: Performance Measure Or, as an instantaneous estimate: J’(n) = -e(n)v(n) • The error at iteration n: • e(n) = = d(n) – y(n) • v(n) = linear combiner output at iteration n; • E[.] = expectation operator Artificial Neural Networks

  20. Learning: Performance Measure (Cont’d) • Can we derive our learning rule by minimizing this performance function [Haykin’s textbook]: • Now v(n) = wT(n)x(n), thus • Learning rule: Artificial Neural Networks

  21. Presentation of Training Examples • Presenting all training examples once to the ANN is called an epoch. • In incremental stochastic gradient descent training examples can be presented in: • Fixed order (1,2,3…,M) • Randomly permutated order (5,2,7,…,3) • Completely random (4,1,7,1,5,4,……) Artificial Neural Networks

  22. Concluding Remarks • A single layer perceptron can perform pattern classification only on linearly separable patterns, regardless of the type of nonlinearity (hard limiter, sigmoidal) • Papert and Minsky in 1969 elucidated limitations of Rosenblatt’s single layer perceptron (e.g. requirement of linear separability, inability to solve XOR problem) and cast doubt on the viability of neural networks • However, multilayer perceptron and the back-propagation algorithm overcomes many of the shortcomings of the single layer perceptron Artificial Neural Networks

  23. Adaline: Adaptive Linear Element • The output y is a linear combination of the input x: x1  w1 y x2 w2 wm xm Artificial Neural Networks

  24. Adaline: Adaptive Linear Element (Cont’d) • Adaline: uses a linear neuron model and the Least-Mean-Square (LMS) learning algorithm The idea: try to minimize the square error, which is a function of the weights • We can find the minimum of the error function E by means of the Steepest descent method (Optimization Procedure) Artificial Neural Networks

  25. Steepest Descent Method: Basics • Start with an arbitrary point • find a direction in which E is decreasing most rapidly • make a small step in that direction Artificial Neural Networks

  26. (w1,w2) (w1+w1,w2 +w2) Steepest Descent Method: Basics (Cont’d) Artificial Neural Networks

  27. Steepest Descent Method: Basics (Cont’d) gradient? global min local min Artificial Neural Networks

  28. Least-Mean-Square algorithm (Widrow-Hoff Algorithm) • Approximation of gradient(E) • Update rule for the weights becomes: Artificial Neural Networks

  29. Summary of LMS algorithm Training sample: Input signal vectorx(n) Desired responsed(n) User selected parameter  >0 Initialization set ŵ(1) = 0 Computation for n = 1, 2, … compute e(n) = d(n) - ŵT(n)x(n) ŵ(n+1) = ŵ(n) + x(n)e(n) Artificial Neural Networks

  30. Neuron with Sigmoid-Function x1 w1 Output Activation y x2 w2 Inputs  wm xm Weights Artificial Neural Networks

  31. Multi-Layer Neural Networks Output layer Hidden layer Input layer Artificial Neural Networks

  32. Backpropagation Principal yj dj wjk Backward Step: Propagate errors from output to hidden layer dk xk wki xi Forward Step: Propagate activation from input to output layer Artificial Neural Networks

  33. Backpropagation Algorithm • Initialize each wi to some small random value • Until the termination condition is met, Do • For each training example <(x1,…xn),t> Do • Input the instance (x1,…,xn) to the network and compute the network outputs yk • For each output unit k • k=yk(1-yk)(tk-yk) • For each hidden unit h • h=yh(1-yh) k wh,k k • For each network weight wi,j Do • wi,j=wi,j+wi,j where wi,j=  j xi,j Artificial Neural Networks

  34. Backpropagation Algorithm (Cont’d) • Gradient descent over entire network weight vector • Easily generalized to arbitrary directed graphs • Will find a local, not necessarily global error minimum -in practice often works well (can be invoked multiple times with different initial weights) • Often include weight momentum term wi,j(n)=  j xi,j +  wi,j (n-1) • Minimizes error training examples • Will it generalize well to unseen instances (over-fitting)? • Training can be slow typical 1000-10000 iterations (use Levenberg-Marquardt instead of gradient descent) • Using network after training is fast Artificial Neural Networks

  35. Convergence of Backpropagation Gradient descent to some local minimum perhaps not global minimum • Add momentum term: wki(n) • wki(n) = adk(n) xi (n) + l Dwki(n-1) with l [0,1] • Stochastic gradient descent • Train multiple nets with different initial weights Nature of convergence • Initialize weights near zero • Therefore, initial networks near-linear • Increasingly non-linear functions possible as training progresses Artificial Neural Networks

  36. Optimization Methods • There are other more efficient (faster convergence) optimization methods than gradient descent: • Newton’s method uses a quadratic approximation (2nd order Taylor expansion) • F(x+Dx) = F(x) + F(x) Dx + Dx 2F(x) Dx + … • Conjugate gradients • Levenberg-Marquardt algorithm Artificial Neural Networks

  37. Universal Approximation Property of ANN Boolean Functions: • Every boolean function can be represented by network with single hidden layer • But might require exponential (in number of inputs) hidden units Continuous Functions: • Every bounded continuous function can be approximated with arbitrarily small error, by network with one hidden layer [Cybenko 1989, Hornik 1989] • Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988] Artificial Neural Networks

  38. Using Weight Derivatives • How often to update • after each training case? • after a full sweep through the training data? • How much to update • Use a fixed learning rate? • Adapt the learning rate? • Add momentum? • Don’t use steepest descent? Artificial Neural Networks

  39. What Next? • Bias Effect • Batch vs. Continuous Learning • Variable Learning Rate (Update Rule?) • Effect of Neurons/Layer • Effect of Hidden Layers Artificial Neural Networks

More Related