1 / 54

Longin Jan Latecki Temple University latecki@temple

Ch. 2: Linear Discriminants Stephen Marsland, Machine Learning: An Algorithmic Perspective .  CRC 2009 based on slides from Stephen Marsland, from Romain Thibaux (regression slides), and Moshe Sipper. Longin Jan Latecki Temple University latecki@temple.edu. w 1. w 2. w m.

lynettef
Download Presentation

Longin Jan Latecki Temple University latecki@temple

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ch. 2: Linear DiscriminantsStephen Marsland, Machine Learning: An Algorithmic Perspective.  CRC 2009based on slides from Stephen Marsland,from RomainThibaux (regression slides),and Moshe Sipper Longin Jan Latecki Temple University latecki@temple.edu

  2. w1 w2 wm McCulloch and Pitts Neurons • Greatly simplified biological neurons • Sum the inputs • If total is less than some threshold, neuron fires • Otherwise does not x1 h x2 o  xm Stephen Marsland

  3. McCulloch and Pitts Neurons • The weight wj can be positive or negative • Inhibitory or exitatory • Use only a linear sum of inputs • Use a simple output instead of a pulse (spike train) for some threshold  Stephen Marsland

  4. Neural Networks • Can put lots of McCulloch & Pitts neurons together • Connect them up in any way we like • In fact, assemblies of the neurons are capable of universal computation • Can perform any computation that a normal computer can • Just have to solve for all the weights wij Stephen Marsland

  5. Training Neurons • Adapting the weights is learning • How does the network know it is right? • How do we adapt the weights to make the network right more often? • Training set with target outputs • Learning rule Stephen Marsland

  6. 2.2 The Perceptron The perceptron is considered the simplest kind of feed-forward neural network. Definition from Wikipedia: The perceptron is a binary classifier which maps its input x (a real-valued vector) to an output value f(x) (a single binary value) across the matrix: In order to not explicitly write b, we extend the input vector x by one more dimension that is always set to -1, e.g., x=(-1,x_1, …, x_7) with x_0=-1, and extend the weight vector to w=(w_0,w_1, …, w_7). Then adjusting w_0 corresponds to adjusting b.

  7. Bias Replaces Threshold -1 Inputs Outputs Stephen Marsland

  8. Perceptron Decision = Recall • Outputs are: For example, y=(y_1, …, y_5)=(1, 0, 0, 1, 1) is a possible output. We may have a different function g in the place of sign, as in (2.4) in the book. Stephen Marsland

  9. Perceptron Learning = Updating the Weights • We want to change the values of the weights • Aim: minimise the error at the output • If E = t-y, want E to be 0 • Use: Input Learning rate Error Stephen Marsland

  10. Example 1: The Logical OR -1 W_0 W_1 W_2 Initial values: w_0(0)=-0.05, w_1(0) =-0.02, w_2(0)=0.02, and =0.25 Take first row of our training table: y_1= sign( -0.05×-1 + -0.02×0 + 0.02×0 ) = 1 w_0(1) = -0.05 + 0.25×(0-1)×-1=0.2 w_1(1) = -0.02 + 0.25×(0-1)×0=-0.02 w_2(1) = 0.02 + 0.25×(0-1)×0=0.02 We continue with the new weights and the second row, and so on We make several passes over the training data.

  11. Decision boundary for OR perceptron Stephen Marsland

  12. Perceptron Learning Applet • http://lcn.epfl.ch/tutorial/english/perceptron/html/index.html

  13. RS LS w1 w2 w4 w3  = 0.3  = -0.01 LM RM Example 2: Obstacle Avoidance with the Perceptron LS RS LM RM Stephen Marsland

  14. Obstacle Avoidance with the Perceptron Stephen Marsland

  15. Obstacle Avoidance with the Perceptron RS LS w1 w2 w4 w1=0+0.3 * (1-1) * 0 = 0 w3 LM RM Stephen Marsland

  16. Obstacle Avoidance with the Perceptron RS LS w2=0+0.3 * (1-1) * 0 = 0 w1 w2 w4 And the same for w3, w4 w3 LM RM Stephen Marsland

  17. Obstacle Avoidance with the Perceptron Stephen Marsland

  18. Example 1: Obstacle Avoidance with the Perceptron RS LS w1=0+0.3 * (-1-1) * 0 = 0 w1 w2 w4 w3 LM RM Stephen Marsland

  19. Obstacle Avoidance with the Perceptron RS LS w1=0+0.3 * (-1-1) * 0 = 0 w2=0+0.3 * ( 1-1) * 0 = 0 w1 w2 w4 w3 LM RM Stephen Marsland

  20. Obstacle Avoidance with the Perceptron RS LS w1=0+0.3 * (-1-1) * 0 = 0 w2=0+0.3 * ( 1-1) * 0 = 0 w3=0+0.3 * (-1-1) * 1 = -0.6 w1 w2 w4 w3 LM RM Stephen Marsland

  21. Obstacle Avoidance with the Perceptron RS LS w1=0+0.3 * (-1-1) * 0 = 0 w2=0+0.3 * ( 1-1) * 0 = 0 w3=0+0.3 * (-1-1) * 1 = -0.6 w4=0+0.3 * ( 1-1) * 1 = 0 w1 w2 w4 w3 LM RM Stephen Marsland

  22. Obstacle Avoidance with the Perceptron Stephen Marsland

  23. Obstacle Avoidance with the Perceptron RS LS w1=0+0.3 * ( 1-1) * 1 = 0 w2=0+0.3 * (-1-1) * 1 = -0.6 w3=-0.6+0.3 * ( 1-1) * 0 = -0.6 w4=0+0.3 * (-1-1) * 0 = 0 w1 w2 w4 w3 LM RM Stephen Marsland

  24. Obstacle Avoidance with the Perceptron RS LS 0 0 -0.6 -0.6 -0.01 -0.01 LM RM Stephen Marsland

  25. 2.3 Linear Separability • Outputs are: where and  is the angle between vectors x and w. Stephen Marsland

  26. w Geometry of linear Separability The equation of a line is w_0 + w_1*x + w_2*y=0 It also means that point (x,y) is on the line This equation is equivalent to wx = (w_0, w_1,w_2) (1,x,y) = 0 If wx > 0, then the angle between w and x is less than 90 degree, which means that w and x lie on the same side of the line. Each output node of perceptron tries to separate the training data Into two classes (fire or no-fire) with a linear decision boundary, i.e., straight line in 2D, plane in 3D, and hyperplane in higher dim. Stephen Marsland

  27. Linear Separability The Binary AND Function Stephen Marsland

  28. Gradient Descent Learning Rule • Consider linear unit without threshold and continuous output o (not just –1,1) • y=w0 + w1 x1 + … + wn xn • Train the wi’s such that they minimize the squared error • E[w1,…,wn] = ½ dD (td-yd)2 where D is the set of training examples

  29. Supervised Learning • Training and test data sets • Training set; input & target

  30. (w1,w2) Gradient: E[w]=[E/w0,… E/wn] (w1+w1,w2 +w2) Gradient Descent D={<(1,1),1>,<(-1,-1),1>, <(1,-1),-1>,<(-1,1),-1>} w=- E[w] wi=- E/wi /wi 1/2d(td-yd)2 = d /wi 1/2(td-i wi xi)2 = d(td- yd)(-xi)

  31. Gradient Descent Error wi=- E/wi Stephen Marsland

  32. Incremental Stochastic Gradient Descent • Batch mode : gradient descent w=w -  ED[w] over the entire data D ED[w]=1/2d(td-yd)2 • Incremental mode: gradient descent w=w -  Ed[w] over individual training examples d Ed[w]=1/2 (td-yd)2 Incremental Gradient Descent can approximate Batch Gradient Descent arbitrarily closely if  is small enough

  33. Gradient Descent Perceptron Learning Gradient-Descent(training_examples, ) Each training example is a pair of the form <(x1,…xn),t> where (x1,…,xn) is the vector of input values, and t is the target output value,  is the learning rate (e.g. 0.1) • Initialize each wi to some small random value • Until the termination condition is met, Do • For each <(x1,…xn),t> in training_examples Do • Input the instance (x1,…,xn) to the linear unit and compute the output y • For each linear unit weight wi Do • wi=  (t-y) xi • For each linear unit weight wi Do • wi=wi+wi

  34. Limitations of the Perceptron The Exclusive Or (XOR) function. Linear Separability Stephen Marsland

  35. Limitations of the Perceptron W1 > 0 W2 > 0 W1 + W2 < 0 ? Stephen Marsland

  36. Limitations of the Perceptron? In2 In1 In3 Stephen Marsland

  37. 2.4 Linear regression 40 26 24 Temperature 22 20 20 30 40 20 30 20 10 0 10 0 10 20 0 0 Given examples given a new point Predict

  38. 26 24 22 20 30 40 20 30 Prediction 20 10 Prediction 10 0 0 Linear regression 40 Temperature 20 0 0 20

  39. Sum squared error Ordinary Least Squares (OLS) Error or “residual” Observation Prediction 0 0 20

  40. Minimize the sum squared error Sum squared error Linear equation Linear system

  41. Alternative derivation n d Solve the system (it’s better not to invert the matrix)

  42. 40 20 0 0 10 20 Beyond lines and planes still linear in everything is the same with

  43. Geometric interpretation 20 10 400 0 300 200 -10 100 0 10 20 0 [Matlab demo]

  44. Ordinary Least Squares [summary] Given examples Let For example Let n d Minimize by solving Predict

  45. Probabilistic interpretation 0 0 20 Likelihood

  46. Summery • Perceptron and regression optimize the same target function • In both cases we compute the gradient (vector of partial derivatives) • In the case of regression, we set the gradient to zero and solve for vector w. As the solution we have a closed formula for w such that the target function obtains the global minimum. • In the case of perceptron, we iteratively go in the direction of the minimum by going in the direction of minus the gradient. We do this incrementally making small steps for each data point.

  47. Homework 1 • (Ch. 2.3.3) Implement perceptron in Matlab and test it on the Pmia Indian Dataset from UCI Machine Learning Repository:http://archive.ics.uci.edu/ml/ • (Ch. 2.4.1) Implementing linear regression in Matlab and apply it to auto-mpg dataset.

  48. From Ch. 3: Testing • How do we evaluate our trained network? • Can’t just compute the error on the training data - unfair, can’t see overfitting • Keep a separate testing set • After training, evaluate on this test set • How do we check for overfitting? • Can’t use training or testing sets Stephen Marsland

  49. Validation • Keep a third set of data for this • Train the network on training data • Periodically, stop and evaluate on validation set • After training has finished, test on test set • This is coming expensive on data! Stephen Marsland

More Related