Linear Discriminators

Linear Discriminators Chapter 20 Only relevant parts

Concerns • Generalization Accuracy • Efficiency • Noise • Irrelevant features • Generality: when does this work?

Linear Model • Let f1, …fn be the feature values of an example. • Let class be denoted {+1, -1}. • Define f0 = -1. (bias weight) • Linear model defines weights w0,w1,..wn. • -w0 is the threshold • Classification rule: • If w0*f0+w1*f1..+wn*fn> 0, predict class + else predict class -. • Briefly: W*F>0 where * is inner product of weight vector and feature weights and F has been augmented with extra 1.

Augmentation Trick • Suppose data defined features f1 and f2. • 2* f1 + 3*f2 > 4 is classifier • Equivalently: <4,2,3> *<-1,f1,f2> > 0 • Mapping data <f1,f2> to <-1,f1,f2> allows learning/representing threshold as just another featuer. • Mapping data into higher dimensions is key idea behind SVMs

Mapping to enable Linear Separation • Let xi be m vectors in R^N. • Map xi into R^{N+M} by xi -> <xi,0,..1,0..> where 1 in n+i position. • For any labelling of xi by classes +/-, the embedding makes data linearly separable. • Define wi = 0 i<N • w(i+n) = 1 if xi is + else 0. • W(i+n) = -1 if xi is negative else 0.

Representational Power • “Or” of n features • Wi = 1, threshold = 0 • “And” of n features • Wi = 1 threshold = n -1 • K of n features (prototype) • Wi =1 threshold = k -1 • Can’t do XOR • Combining linear threshold units yields any boolean function.

Classical Perceptron • Goal: Any W which separates the data. • Algorithm (X is augmented with 1) • W = 0 • Repeat • If X positive and W*X wrong, W = W+X; • Else if X negative & W*X wrong, W = W-X. • Until no errors or very large number of times.

Classical Perceptron • Theorem: If concept linearly separable, then algorithm finds a solution. • Training time can be exponential in number of features. • Epoch is single pass through entire data. • Convergence can take exponentially many epochs, but guaranteed to work. • If |xi|<R and margin = m, then number of mistake is < R^2/m^2.

Hill-Climbing Search • This is an optimization problem. • The solution is by hill-climbing so there is no guarantee of finding the optimal solution. • While derivates tell you the direction (the negative gradient) they do not tell you how much to change each Xi. • On the plus side it is fast. • On the negative side, no guarantee of separation

Hill-climbing View • Goal: minimize Squared-error = Err^2. • Let class yi be 1 or -1. • Let Err = sum(W*Xi –Yi) where Xi is ith example. • This is a function only of the weights. • Use Calculus; take partial derivates wrt Wj. • To move to lower value, move in direction of negative gradient, i.e. • change in Xi is -2*Err*Xj

Support Vector Machine • Goal: maximize the margin. • Assuming the line separates the data, the margin is the minimum of the closest positive and negative example to the line. • Good News: This can be solved by quadratic program. • Implemented in Weka as SOM. • If not linearly separable, SVM will add more features.

If not Linearly Separable • Add more nodes: Neural Nets • Can Represent any boolean function: why? • No guarantees about learning • Slow • Incomprehensible • Add more features: SVM • Can represent any boolean function • Learning guarantees • Fast • Semi-comprehensible

Adding features • Suppose pt (x,y) is positive if it lies in the unit disk else negative. • Clearly very unlinearly separable • Map (x,y) -> (x,y, x^2+y^2) • Now in 3-space, easily separable. • This works for any learning algorithm, but SVM will almost do it for you. (set parameters).

Linear Discriminators