Presented by: Ding-Ying Chiu Date: 2008/10/17

Classification Numerical classifiers Introduction to Data mining Pang-Ning Tan, Michael Steinbach, Vipin Kumar Presented by: Ding-Ying Chiu Date: 2008/10/17

IntroductionApplications • Training • Classifier • Applications of classification • handwritten digital detection • human face recognition

Class label appropriate inappropriate inappropriate 20 45 60 80 85 95 inappropriate appropriate appropriate Input vector Math Training dataOne dimension • Training data • a training datum consists of an input vector and a class label. • Feature • the concept of feature is used to denote a piece of information of the objects.A feature is a dimension.

Classifying step • a test datumonly has the input vector and will be decided its answer vector by the boundaries. Class label appropriate inappropriate inappropriate 20 45 60 65 80 85 95 inappropriate appropriate appropriate Input vector Math inappropriate boundary Two steps One dimension • Training step • Based on the training data, we can compute a boundary to separate the two classes.

X+Y- 8 = 0 (0, 8) A (4, 4) B (6, 7) C (2, 2) (8, 0) PreliminaryLinear programming • The math course in senior high school If we replace x and y by a vector and the result of the function is zero, then the vector lies in the line. If we replace x and y by a vector and the result of the function is greater than zero, then the vector is above the line. If we replace x and y by a vector and the result of the function is less than zero, then the vector is below the line.

a1X+ a2Y + a3 = 0 10, 90 appropriate 80, 80 inappropriate 85, 70 30, 60 50, 60 70, 60 95, 50 60, 45 Linear programming 80, 20 Two dimensionsSimple case English 100 65, 100 80 60 40 20 Math 20 40 60 80 100

a1X + a2Y + a3 = 0 a3 a1 X b1X + b2Y + b3 = 0  Y a2 b3 b1 X  Y b2 c3 c1 X c1X + c2Y + c3 = 0  Y c2 Two dimensionsComplex case

a3 a3 a1 X  a1 Perceptron Y a2 b1 b3 b3 b1 X a2  Y b2 b2 c1 Output layer c3 c3 c1 X c2  Y c2 Input layer Hidden layer Two dimensions Neural networks X Y Page: 247 250 Page: 251

x1 X0=1 a1X + a2Y + a3 = 0 w1 w2 a3 x2 a1 w0 X . . .   Y a2 wd xd Page: 248 – 5.23 (wd, wd-1, …, w1, w0) (xd, xd-1, …, x1, x0) = WX PreliminaryDot product wdxd+wd-1xd-1+…+w1x1+w0x0

x2 10 y=1 (3, 9) (7, 8) Error function (Page:253 – 5.25) (2, 6) 5 (9, 5) y=-1 (7, 3) (4, 2) 9x1+13x2–117=0 x1 5 10 Learning the ANN Model Error function Page: 247 w1=9, w2=13, w0=-117

Error function (Page:253 – 5.25) Learning the ANN Model Bad line x2 11x1+2x2-66=0 10 y=1 (3, 9) (7, 8) (2, 6) 5 (9, 5) E(w)=0.5*{(1-(-1))2+(1-1)2+(1-1)2 +(1-1)2+(1-1)2+(-1-1)2} = 4 y=-1 (7, 3) (4, 2) x1 5 10

x2 10 (3, 9) (7, 8) (2, 6) 5 (9, 5) (7, 3) (4, 2) x1 5 10 Learning the ANN ModelGoal • Page 253: • The goal of the ANN learning algorithm is to determine a set of weights w that minimize the total sum of squared errors: • A trick for finding minimum value • Gradient descent method

f(x+x) • Taylor theorem x (new)x’=(old)x’+x + + Gradient descent methodMain idea Start with a point (random) f(x) Repeat Determine a descent direction Choose a step (λ> 0) Update Until stopping criterion is satisfied random x’ • A trick for finding minimum value

f(x+x) • Taylor theorem x + + Gradient descent methodMain idea Start with a point (random) f(x) Repeat Determine a descent direction Choose a step (λ> 0) Update Until stopping criterion is satisfied • A trick for finding minimum value

x 3 (new)x’=3-2=1 Gradient descent methodExample f(x)=(x+1)2+2 λ= 0.25 f’(x)=2(x+1) 15 x=3, x = -0.25*8 = -2 10 5

-0.5 0 1 Gradient descent methodExample f(x)=(x+1)2+2 λ= 0.25 f’(x)=2(x+1) 15 x=3, x = -0.25*8 = -2 10 x=1,x = -0.25*4 = -1 5 x=0,x = -0.25*2 = -0.5

Gradient descent methodTwo dimensions (from MATLAB demo)

f(x+x) f(x) x random Page: 254 – 5.26 (new)x’ =(old)x’+x x’ Gradient descent methodError function • Minimum target : • Adjust value : • 

1 117 -1 Sigmoid function

Sigmoid functionNice property

For two cases Derive(From Machine Learning[6])

Output Unit(1)(From Machine Learning[6]) There is no direct relationship Between Ed and netj

Output Unit(2)(From Machine Learning[6])

Error function (Page:253 – 5.25) Learning the ANN Model Adjust x2 11x1+2x2-66=0 10 y=1 (3, 9) (7, 8) (2, 6) 5 (9, 5) E(w)=0.5*{(1-(-1))2+(1-1)2+(1-1)2 +(1-1)2+(1-1)2+(-1-1)2} = 4 y=-1 (7, 3) (4, 2) x1 5 10

The advantages of neural networkTo approximate any function • Multilayer neural networks with at least one hidden layer are universal approximators; i.e., they can be used to approximate any target functions. (Page 255) • Feedforward networks containing three layers of units are able to approximate any function to arbitrary accuracy, given a sufficient (potentially very large) number of units in each layer. (From Machine learning[6]. Page 122)

Height 60 Math The advantages of neural networkHandle redundant features • ANN can handle redundant features because the weights are automatically learned during the training step. The weights for redundant features tend to be very small. (Page 256) • Feature selection 1*x + 0*y – 60 = 0

Height 60 Math The disadvantages of neural networkSensitive to the presence of noise • Neural networks are quite sensitive to the presence of noise in the training data.

f(x) x x The disadvantages of neural networkLocal minimum • The gradient descent method used for learning the weights of an ANN often converges to some local minimum. f(x)

The disadvantages of neural networkTime consuming • Training an ANN is a time consuming process, especially when the number of hidden nodes is large. x2 11x1+2x2-66=0 10 Y=1 (3, 9) (7, 8) (2, 6) 5 (9, 5) Y=-1 (7, 3) (4, 2) x1 5 10

28 784 dimensions 28 Real datasetMNIST

Feature selectionWavelet transform • A multimedia object can be represented by the low-level features • High dimension • Wavelet

Introduction 90000 22500 5625 1406

Result of wavelet transform

Support Vector MachinesWhich hyperplane? y=1 y=-1

d- d- d+ Support Vector MachinesMargin y=1 y=-1 Margin = |d+|+|d-|

Support vectors Support Vector MachinesMaximum Margin y=1 d+ d- d- y=-1 d+

Find: 9x1+13x2–117=0 Support Vector MachinesClassifier of Two Classes • Training data (x1,y1),…,(xn,yn) yi{1,-1} x1 10 x2 y=1 (3, 9) (7, 8) x4 x3 (2, 6) 5 (9, 5) x6 y=-1 x5 (7, 3) (4, 2) 5 10

d+ d- d- d+ Support Vector MachinesRescale • We can rescale the parameters w and b of the decision boundary so that the two parallel hyperplanes bi1 and bi2 can be expressed as follows: (Page 261 – 5.32&5.33) y=1 y=-1 bi1 bi2

x1 x1-x2 x2 Page:261 5.32 5.33 5.34 Support Vector MachinesMargin d y=1 y=-1 d bi1 bi2

d • The learning task in SVM can be formalized as the following constrained optimization problem: Page:262 Definition 5.1 Support Vector MachinesObjective function y=1 y=-1

The learning task in SVM can be formalized as the following constrained optimization problem: Lagrange multipliersProblem • Problem • A maximum or minimum function • A constraint • Ex: A T-shirt costs Px dollars and a skirt costs Py dollars. Happy function U(X,Y), Income A. Max(U(X,Y)) subject to PxX+PyY  A

If (PxX+PyY-A) > 0, the result is positive. Lagrange multipliersConcept • Lagrange multipliers λ 0 L(X,Y,) = U(X,Y) - (PxX+PyY-A) • Ex: A T-shirt costs Px dollars and a skirt costs Py dollars. Happy function U(X,Y), Income A. Max(U(X,Y)) subject to PxX+PyY  A • Transforming the constrained maximization problem to an unconstrained maximization problem.

Lagrange multipliersExample • U(X,Y)=XY and Px = 2, Py = 4 , I = 40 MaxL(X,Y,)=XY - (2X+4Y-40) L/X = Y-2 = 0 ……………(1) L/Y = X-4 = 0 ……………(2) L/ = 40 – 2X – 4Y = 0 .…(3) by (1) and (2)  x = 2y …….(4) by (3) and (4)  40 – 8y = 0  y = 5 , x = 10 =2.5 Problem: A maximum or minimum function A constraint Lagrange Multipliers: Transforming the constrained maximization problem to an unconstrained maximization problem

If (yi(wxi+b)<1), the result is negative. Page: 262 5.39 5.40 Don’t forget it Support Vector MachinesLagrange multipliers • Constrained optimization problem • Lagrange Multipliers：

Support Vector MachinesNon-separable Case – simple case Non-linear mapping function (x) = x2 -2 -1 0 1 2 3

(X) (X) (X) (X) (X) (X) (X) (X) (O) (X) (O) (X) (O) (X) (O) (O) (O) (X) (O) (X) (X) Support Vector MachinesNon-separable Case Non-linear mapping function  Hard

(X) (X) (X) (X) (X) (X) (X) (X) (O) Page: 273 5.59 (X) (O) (X) (O) (X) (O) (O) (O) (X) (O) (X) (X) Support Vector MachinesObservation Non-linear mapping function 

Page: 273 • Wedo not have to know the exact form of the mapping function  because the kernel functions used in nonlinear SVM must satisfy a mathematical principle known as Mercer’s theorem. (Theorem 5.1) (X) (X) (X) (X) (X) (X) (X) (X) (O) (X) (O) (X) (O) (X) (O) (O) (O) (X) (O) (X) (X) Support Vector MachinesKernel Trick • This principle ensures that the kernel functions can always be expressed as the dot product between two input vectors in some high-dimensional space.

Kernel TrickExample R2 R3 We don’t have to know the mapping function 

(X) (X) (X) (X) (X) (X) (X) (X) Polynomial (O) (X) (O) Page:275 5.63 5.64 5.65 (X) (O) (X) (O) (O) Gaussian (O) (X) (O) (X) (X) Sigmoidal Support Vector MachinesKernel Trick

Presented by: Ding-Ying Chiu Date: 2008/10/17