1 / 65

Presented by: Ding-Ying Chiu Date: 2008/10/17

Classification. Numerical classifiers Introduction to Data mining Pang-Ning Tan, Michael Steinbach, Vipin Kumar. Presented by: Ding-Ying Chiu Date: 2008/10/17. Introduction Applications. Training Classifier. Applications of classification. handwritten digital detection.

caitlinm
Download Presentation

Presented by: Ding-Ying Chiu Date: 2008/10/17

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Classification Numerical classifiers Introduction to Data mining Pang-Ning Tan, Michael Steinbach, Vipin Kumar Presented by: Ding-Ying Chiu Date: 2008/10/17

  2. IntroductionApplications • Training • Classifier • Applications of classification • handwritten digital detection • human face recognition

  3. Class label appropriate inappropriate inappropriate 20 45 60 80 85 95 inappropriate appropriate appropriate Input vector Math Training dataOne dimension • Training data • a training datum consists of an input vector and a class label. • Feature • the concept of feature is used to denote a piece of information of the objects.A feature is a dimension.

  4. Classifying step • a test datumonly has the input vector and will be decided its answer vector by the boundaries. Class label appropriate inappropriate inappropriate 20 45 60 65 80 85 95 inappropriate appropriate appropriate Input vector Math inappropriate boundary Two steps One dimension • Training step • Based on the training data, we can compute a boundary to separate the two classes.

  5. X+Y- 8 = 0 (0, 8) A (4, 4) B (6, 7) C (2, 2) (8, 0) PreliminaryLinear programming • The math course in senior high school If we replace x and y by a vector and the result of the function is zero, then the vector lies in the line. If we replace x and y by a vector and the result of the function is greater than zero, then the vector is above the line. If we replace x and y by a vector and the result of the function is less than zero, then the vector is below the line.

  6. a1X+ a2Y + a3 = 0 10, 90 appropriate 80, 80 inappropriate 85, 70 30, 60 50, 60 70, 60 95, 50 60, 45 Linear programming 80, 20 Two dimensionsSimple case English 100 65, 100 80 60 40 20 Math 20 40 60 80 100

  7. a1X + a2Y + a3 = 0 a3 a1 X b1X + b2Y + b3 = 0  Y a2 b3 b1 X  Y b2 c3 c1 X c1X + c2Y + c3 = 0  Y c2 Two dimensionsComplex case

  8. a3 a3 a1 X  a1 Perceptron Y a2 b1 b3 b3 b1 X a2  Y b2 b2 c1 Output layer c3 c3 c1 X c2  Y c2 Input layer Hidden layer Two dimensions Neural networks X Y Page: 247 250 Page: 251

  9. x1 X0=1 a1X + a2Y + a3 = 0 w1 w2 a3 x2 a1 w0 X . . .   Y a2 wd xd Page: 248 – 5.23 (wd, wd-1, …, w1, w0) (xd, xd-1, …, x1, x0) = WX PreliminaryDot product wdxd+wd-1xd-1+…+w1x1+w0x0

  10. x2 10 y=1 (3, 9) (7, 8) Error function (Page:253 – 5.25) (2, 6) 5 (9, 5) y=-1 (7, 3) (4, 2) 9x1+13x2–117=0 x1 5 10 Learning the ANN Model Error function Page: 247 w1=9, w2=13, w0=-117

  11. Error function (Page:253 – 5.25) Learning the ANN Model Bad line x2 11x1+2x2-66=0 10 y=1 (3, 9) (7, 8) (2, 6) 5 (9, 5) E(w)=0.5*{(1-(-1))2+(1-1)2+(1-1)2 +(1-1)2+(1-1)2+(-1-1)2} = 4 y=-1 (7, 3) (4, 2) x1 5 10

  12. x2 10 (3, 9) (7, 8) (2, 6) 5 (9, 5) (7, 3) (4, 2) x1 5 10 Learning the ANN ModelGoal • Page 253: • The goal of the ANN learning algorithm is to determine a set of weights w that minimize the total sum of squared errors: • A trick for finding minimum value • Gradient descent method

  13. f(x+x) • Taylor theorem x (new)x’=(old)x’+x + + Gradient descent methodMain idea Start with a point (random) f(x) Repeat Determine a descent direction Choose a step (λ> 0) Update Until stopping criterion is satisfied random x’ • A trick for finding minimum value

  14. f(x+x) • Taylor theorem x + + Gradient descent methodMain idea Start with a point (random) f(x) Repeat Determine a descent direction Choose a step (λ> 0) Update Until stopping criterion is satisfied • A trick for finding minimum value

  15. x 3 (new)x’=3-2=1 Gradient descent methodExample f(x)=(x+1)2+2 λ= 0.25 f’(x)=2(x+1) 15 x=3, x = -0.25*8 = -2 10 5

  16. -0.5 0 1 Gradient descent methodExample f(x)=(x+1)2+2 λ= 0.25 f’(x)=2(x+1) 15 x=3, x = -0.25*8 = -2 10 x=1,x = -0.25*4 = -1 5 x=0,x = -0.25*2 = -0.5

  17. Gradient descent methodTwo dimensions (from MATLAB demo)

  18. f(x+x) f(x) x random Page: 254 – 5.26 (new)x’ =(old)x’+x x’ Gradient descent methodError function • Minimum target : • Adjust value : • 

  19. 1 117 -1 Sigmoid function

  20. Sigmoid functionNice property

  21. For two cases Derive(From Machine Learning[6])

  22. Output Unit(1)(From Machine Learning[6]) There is no direct relationship Between Ed and netj

  23. Output Unit(2)(From Machine Learning[6])

  24. Error function (Page:253 – 5.25) Learning the ANN Model Adjust x2 11x1+2x2-66=0 10 y=1 (3, 9) (7, 8) (2, 6) 5 (9, 5) E(w)=0.5*{(1-(-1))2+(1-1)2+(1-1)2 +(1-1)2+(1-1)2+(-1-1)2} = 4 y=-1 (7, 3) (4, 2) x1 5 10

  25. The advantages of neural networkTo approximate any function • Multilayer neural networks with at least one hidden layer are universal approximators; i.e., they can be used to approximate any target functions. (Page 255) • Feedforward networks containing three layers of units are able to approximate any function to arbitrary accuracy, given a sufficient (potentially very large) number of units in each layer. (From Machine learning[6]. Page 122)

  26. Height 60 Math The advantages of neural networkHandle redundant features • ANN can handle redundant features because the weights are automatically learned during the training step. The weights for redundant features tend to be very small. (Page 256) • Feature selection 1*x + 0*y – 60 = 0

  27. Height 60 Math The disadvantages of neural networkSensitive to the presence of noise • Neural networks are quite sensitive to the presence of noise in the training data.

  28. f(x) x x The disadvantages of neural networkLocal minimum • The gradient descent method used for learning the weights of an ANN often converges to some local minimum. f(x)

  29. The disadvantages of neural networkTime consuming • Training an ANN is a time consuming process, especially when the number of hidden nodes is large. x2 11x1+2x2-66=0 10 Y=1 (3, 9) (7, 8) (2, 6) 5 (9, 5) Y=-1 (7, 3) (4, 2) x1 5 10

  30. 28 784 dimensions 28 Real datasetMNIST

  31. Feature selectionWavelet transform • A multimedia object can be represented by the low-level features • High dimension • Wavelet

  32. Introduction 90000 22500 5625 1406

  33. Result of wavelet transform

  34. Support Vector MachinesWhich hyperplane? y=1 y=-1

  35. d- d- d+ Support Vector MachinesMargin y=1 y=-1 Margin = |d+|+|d-|

  36. Support vectors Support Vector MachinesMaximum Margin y=1 d+ d- d- y=-1 d+

  37. Find: 9x1+13x2–117=0 Support Vector MachinesClassifier of Two Classes • Training data (x1,y1),…,(xn,yn) yi{1,-1} x1 10 x2 y=1 (3, 9) (7, 8) x4 x3 (2, 6) 5 (9, 5) x6 y=-1 x5 (7, 3) (4, 2) 5 10

  38. d+ d- d- d+ Support Vector MachinesRescale • We can rescale the parameters w and b of the decision boundary so that the two parallel hyperplanes bi1 and bi2 can be expressed as follows: (Page 261 – 5.32&5.33) y=1 y=-1 bi1 bi2

  39. x1 x1-x2 x2 Page:261 5.32 5.33 5.34 Support Vector MachinesMargin d y=1 y=-1 d bi1 bi2

  40. d • The learning task in SVM can be formalized as the following constrained optimization problem: Page:262 Definition 5.1 Support Vector MachinesObjective function y=1 y=-1

  41. The learning task in SVM can be formalized as the following constrained optimization problem: Lagrange multipliersProblem • Problem • A maximum or minimum function • A constraint • Ex: A T-shirt costs Px dollars and a skirt costs Py dollars. Happy function U(X,Y), Income A. Max(U(X,Y)) subject to PxX+PyY  A

  42. If (PxX+PyY-A) > 0, the result is positive. Lagrange multipliersConcept • Lagrange multipliers λ 0 L(X,Y,) = U(X,Y) - (PxX+PyY-A) • Ex: A T-shirt costs Px dollars and a skirt costs Py dollars. Happy function U(X,Y), Income A. Max(U(X,Y)) subject to PxX+PyY  A • Transforming the constrained maximization problem to an unconstrained maximization problem.

  43. Lagrange multipliersExample • U(X,Y)=XY and Px = 2, Py = 4 , I = 40 MaxL(X,Y,)=XY - (2X+4Y-40) L/X = Y-2 = 0 ……………(1) L/Y = X-4 = 0 ……………(2) L/ = 40 – 2X – 4Y = 0 .…(3) by (1) and (2)  x = 2y …….(4) by (3) and (4)  40 – 8y = 0  y = 5 , x = 10 =2.5 Problem: A maximum or minimum function A constraint Lagrange Multipliers: Transforming the constrained maximization problem to an unconstrained maximization problem

  44. If (yi(wxi+b)<1), the result is negative. Page: 262 5.39 5.40 Don’t forget it Support Vector MachinesLagrange multipliers • Constrained optimization problem • Lagrange Multipliers:

  45. Support Vector MachinesNon-separable Case – simple case Non-linear mapping function (x) = x2 -2 -1 0 1 2 3

  46. (X) (X) (X) (X) (X) (X) (X) (X) (O) (X) (O) (X) (O) (X) (O) (O) (O) (X) (O) (X) (X) Support Vector MachinesNon-separable Case Non-linear mapping function  Hard

  47. (X) (X) (X) (X) (X) (X) (X) (X) (O) Page: 273 5.59 (X) (O) (X) (O) (X) (O) (O) (O) (X) (O) (X) (X) Support Vector MachinesObservation Non-linear mapping function 

  48. Page: 273 • Wedo not have to know the exact form of the mapping function  because the kernel functions used in nonlinear SVM must satisfy a mathematical principle known as Mercer’s theorem. (Theorem 5.1) (X) (X) (X) (X) (X) (X) (X) (X) (O) (X) (O) (X) (O) (X) (O) (O) (O) (X) (O) (X) (X) Support Vector MachinesKernel Trick • This principle ensures that the kernel functions can always be expressed as the dot product between two input vectors in some high-dimensional space.

  49. Kernel TrickExample R2 R3 We don’t have to know the mapping function 

  50. (X) (X) (X) (X) (X) (X) (X) (X) Polynomial (O) (X) (O) Page:275 5.63 5.64 5.65 (X) (O) (X) (O) (O) Gaussian (O) (X) (O) (X) (X) Sigmoidal Support Vector MachinesKernel Trick

More Related