Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Landmark-Based Speech Recognition:Spectrogram Reading,Support Vector Machines,Dynamic Bayesian Networks,and Phonology Mark Hasegawa-Johnson jhasegaw@uiuc.edu University of Illinois at Urbana-Champaign, USA

Lecture 4: Hyperplanes, Perceptrons, and Kernel-Based Classifiers • Definition: Hyperplane Classifier • Minimum Classification Error Training Methods • Empirical risk • Differentiable estimates of the 0-1 loss function • Error backpropagation • Kernel Methods • Nonparametric expression of a hyperplane • Mathematical properties of a dot product • Kernel-based classifier • The implied high-dimensional space • Error backpropagation for a kernel-based classifier • Useful kernels • Polynomial kernel • RBF kernel

Classifier Terminology

Hyperplane Classifier x Distance=b x x x x x x x x x x x Normal Vector w x x x x x Class Boundary (“Separatrix”): The plane wTx=b x Origin (x=0)

Loss, Risk, and Empirical Risk

Empirical Risk with 0-1 Loss Function = Error Rate on Training Data

Differentiable Approximations of the 0-1 Loss Function: Hinge Loss

Differentiable Empirical Risks

Error Backpropagation: Hyperplane Classifier with Sigmoidal Loss

Sigmoidal Classifier = Hyperplane Classifier with Fuzzy Boundaries x x x x x x x x More Red x x x x x Less Red x x x Less Blue More Blue

Error Backpropagation: Sigmoidal Classifier with Absolute Loss

Sigmoidal Classifier: Signal Flow Diagram Hypothesis h(x) Sigmoid input g(x) + Connection weights w w1 w3 w2 x1 x2 x3 Input x

Multilayer Perceptron Hypothesis h2(x) Sigmoid inputs g2(x) + b21 Connection weights w1 w311 w313 w312 Sigmoid outputs h1(x) Sigmoid inputs g1(x) b11 + + + b12 b13 Connection weights w1 w123 w133 w113 x1 x2 x3 Input h0(x)≡x

Multilayer Perceptron: Classification Equations

Error Backpropagation for a Multilayer Perceptron

Classification Power of a One-Layer Perceptron

Classification Power of a Two-Layer Perceptron

Classification Power of a Three-Layer Perceptron

Output of Multilayer Perceptron is an Approximation of Posterior Probability

Kernel-Based Classifiers

Representation of Hyperplane in terms of Arbitrary Vectors

Kernel-based Classifier

Error Backpropagation for a Kernel-Based Classifier

The Implied High-Dimensional Space

Some Useful Kernels

Polynomial Kernel

Polynomial Kernel: Separatrix (Boundary Between Two Classes) is a Polynomial Surface

Classification Boundaries Available from a Polynomial Kernel(Hastie, Rosset, Tibshirani, and Zhu, NIPS 2004)

Implied Higher-Dimensional Space has a Dimension of Kd

The Radial Basis Function (RBF) Kernel

RBF Classifier Can Represent Any Classifier Boundary

RBF Classifier Can Represent Any Classifier Boundary(Hastie, Rosset, Tibshirani, and Zhu, NIPS 2004) • More training corpus errors • Smoother boundary • Fewer training corpus errors • Wigglier boundary In these figures, C was adjusted, not g, but a similar effect can be achieved by setting N<<M and adjusting g.

If N<M, Gamma can Adjust Boundary Smoothness

Summary • Classifier definitions • Classifier = a function from x into y • Loss = the cost of a mistake • Risk = the expected loss • Empirical Risk = the average loss on training data • Multilayer Perceptrons • Sigmoidal classifier is similar to hyperplane classifier with sigmoidal loss function • Train using error backpropagation • With two hidden layers, can model any boundary (MLP is a “universal approximator”) • MLP output is an estimate of p(y|x) • Kernel Classifiers • Equivalent to: (1) project into f(x), (2) apply hyperplane classifier • Polynomial kernel: separatrix is polynomial surface of order d • RBF kernel: separatrix can be any surface (RBF is also a “universal approximator”) • RBF kernel: if N<M, g can adjust the “wiggliness” of the separatrix

Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA