1 / 13

Linear Discriminators

Linear Discriminators. Chapter 20 Only relevant parts. Concerns. Generalization Accuracy Efficiency Noise Irrelevant features Generality: when does this work?. Linear Model. Let f1, …fn be the feature values of an example. Let class be denoted {+1, -1}. Define f0 = -1. (bias weight)

gaenor
Download Presentation

Linear Discriminators

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Linear Discriminators Chapter 20 Only relevant parts

  2. Concerns • Generalization Accuracy • Efficiency • Noise • Irrelevant features • Generality: when does this work?

  3. Linear Model • Let f1, …fn be the feature values of an example. • Let class be denoted {+1, -1}. • Define f0 = -1. (bias weight) • Linear model defines weights w0,w1,..wn. • -w0 is the threshold • Classification rule: • If w0*f0+w1*f1..+wn*fn> 0, predict class + else predict class -. • Briefly: W*F>0 where * is inner product of weight vector and feature weights and F has been augmented with extra 1.

  4. Augmentation Trick • Suppose data defined features f1 and f2. • 2* f1 + 3*f2 > 4 is classifier • Equivalently: <4,2,3> *<-1,f1,f2> > 0 • Mapping data <f1,f2> to <-1,f1,f2> allows learning/representing threshold as just another featuer. • Mapping data into higher dimensions is key idea behind SVMs

  5. Mapping to enable Linear Separation • Let xi be m vectors in R^N. • Map xi into R^{N+M} by xi -> <xi,0,..1,0..> where 1 in n+i position. • For any labelling of xi by classes +/-, the embedding makes data linearly separable. • Define wi = 0 i<N • w(i+n) = 1 if xi is + else 0. • W(i+n) = -1 if xi is negative else 0.

  6. Representational Power • “Or” of n features • Wi = 1, threshold = 0 • “And” of n features • Wi = 1 threshold = n -1 • K of n features (prototype) • Wi =1 threshold = k -1 • Can’t do XOR • Combining linear threshold units yields any boolean function.

  7. Classical Perceptron • Goal: Any W which separates the data. • Algorithm (X is augmented with 1) • W = 0 • Repeat • If X positive and W*X wrong, W = W+X; • Else if X negative & W*X wrong, W = W-X. • Until no errors or very large number of times.

  8. Classical Perceptron • Theorem: If concept linearly separable, then algorithm finds a solution. • Training time can be exponential in number of features. • Epoch is single pass through entire data. • Convergence can take exponentially many epochs, but guaranteed to work. • If |xi|<R and margin = m, then number of mistake is < R^2/m^2.

  9. Hill-Climbing Search • This is an optimization problem. • The solution is by hill-climbing so there is no guarantee of finding the optimal solution. • While derivates tell you the direction (the negative gradient) they do not tell you how much to change each Xi. • On the plus side it is fast. • On the negative side, no guarantee of separation

  10. Hill-climbing View • Goal: minimize Squared-error = Err^2. • Let class yi be 1 or -1. • Let Err = sum(W*Xi –Yi) where Xi is ith example. • This is a function only of the weights. • Use Calculus; take partial derivates wrt Wj. • To move to lower value, move in direction of negative gradient, i.e. • change in Xi is -2*Err*Xj

  11. Support Vector Machine • Goal: maximize the margin. • Assuming the line separates the data, the margin is the minimum of the closest positive and negative example to the line. • Good News: This can be solved by quadratic program. • Implemented in Weka as SOM. • If not linearly separable, SVM will add more features.

  12. If not Linearly Separable • Add more nodes: Neural Nets • Can Represent any boolean function: why? • No guarantees about learning • Slow • Incomprehensible • Add more features: SVM • Can represent any boolean function • Learning guarantees • Fast • Semi-comprehensible

  13. Adding features • Suppose pt (x,y) is positive if it lies in the unit disk else negative. • Clearly very unlinearly separable • Map (x,y) -> (x,y, x^2+y^2) • Now in 3-space, easily separable. • This works for any learning algorithm, but SVM will almost do it for you. (set parameters).

More Related