Classification: Linear Models

Classification: Linear Models Prof. NavneetGoyal CS & IS BITS, Pilani

Classification • By now, you are well aware of the classification problem • Assign an input vector x to one of the K discrete disjoint classes, Ck • Overlapping classes: Multi-label classification – has many applications • Classifying news articles • Classifying research articles • Medical diagnosis • Input space is partitioned into decision regions whose boundaries are called decision boundaries or decision surfaces • Linear models for classification • Decision surfaces are linear fns. of the input vector x • (D−1) dim. hyperplane within D-dim. input space • Data sets whose classes can be separated exactly by linear decision surfaces are said to be linearly separable

Classification • In regression, the target variable t was simply the vector of real nos. whose values we wish to predict • In classification, there are various ways of using t to represent class labels • Binary class representation is most convenient for probabilistic models • Two class: t ∈ {0, 1}, t = 1 represents C1, t = 0 rep class C2 • Interpret value of t as the probability that class is C1 • Probabilities taking only extreme values of 0 and 1 • For K > 2 - Use a 1-of-K coding scheme. • Eg. K = 5, pattern of class 2 has target vector t = (0, 1, 0, 0, 0)T • Value of tk interpreted as probability of class Ck

Classification • Three different approaches: • Simplest: construct a discrimination fn that directly assigns each vector x to a specific class • Model the conditional probability p(Ck|x) in an inference stage, and then use this distribution to make optimal decisions • What are the benefits of separating inference & decision? • Two ways of doing this: • Option 1: Find p(x|Ck) and prior class probabilities p(Ck) and apply Bayes’ theorem • Option 2: Find joint prob. distribution p(x,Ck) directly and then normalize to obtain the posterior probabilities

Classification Three different approaches • Discriminative Approach • Direct mapping form inputs x to one of the classes • No direct attempt to model either the class conditional or posterior class probabilities • Examples include Perceptrons, Discriminant functions, SVMs • Regression Approach • Class-conditional Approach

Classification Three different approaches • Discriminative Approach • Regression Approach • Posterior class probabilities p(Ck|x) are modeled explicitely • For prediction, the max of these probs. (possibly weighted by a cost fn.) is chosen • Logistic regression • Decision tress • DISCRIMINATIVE – IF TREE ONLY PROVIDES THE PREDICTED CLASS AT EACH LEAF • REGRESSION – IF IN ADDITION THE TREEE PROVIDES POSTERIOR CLASS PROB. DISTR. AT EACH LEAF • Class-conditional Approach

Classification Three different approaches • Discriminative Approach • Regression Approach • Class-conditional Approach • Class conditional distributions p(x|Ck, θk) are modeleld explicitly and along with estimates of p(ck) are inverted visBayes’ rule to arrive at p(Ck|x) for each class Ck, a max. is picked • θk are unknown parameters governing the ch. Of class Ck.

Discriminant Functions • Definition • Linear discriminant functions • 2-class problem & extension to K>2 classes • Methods to learn parameters • Least squares classification • Fisher’s linear discriminant • Perceptrons

Discriminant Functions • Definition: A discriminant is a fn. that takes an input vector x and assigns it to one of K classes • Linear discriminant functions: Decision surfaces are hyperplanes • Searches for the linear combination of the variables that best separates the classes • Discriminative approach since it does not explicitly estimate either the posterior probabilities of classes or the class-conditional distributions

Discriminant Functions Linear Discriminant Function with K=2 Figure taken from Bishop CM’s book – Pattern recognition & ML, Springer 2006.

Discriminant Functions 2 –class Linear Discriminant Function for K=3 (2 binary DFs) 1 vs. rest approach Figure taken from Bishop CM’s book – Pattern recognition & ML, Springer 2006.

Discriminant Functions 2 –class Linear Discriminant Function for K>2 (K*(K-1)/2) binary DFs) 1 vs. 1 approach Figure taken from Bishop CM’s book – Pattern recognition & ML, Springer 2006.

Discriminant Functions • Consider a single K-class DF comprising K linear fns. of the form: • Assign a point x to class Ck if • Decision boundary between Ck & Cj is therefore given by and hence corresponds to a (D-1) dimensional hyperplane: which has the same form as decision surface for 2-class problem

Discriminant Functions Decision regions are always singly connected and convex! Decision Regions for a Multi-class Linear DF Figure taken from Bishop CM’s book – Pattern recognition & ML, Springer 2006.

Parameter Learning in Linear DFs • Three Methods • Least Squares • Fisher’s Linear Discriminant • Perceptrons • Simple but have several disadvantages

Parameter Learning in Linear DFs • Least Squares • In linear regression we have seen that minimization of sum-of-square error fn. lead to a simple closed form solution for parameter values • Can we apply the same formalism to classification problems? • K-class Classification problem • 1-of-K binary coding for target vector t

Least Squares for Classification • Least Squares • In linear regression we have seen that minimization of sum-of-square error fn. lead to a simple closed form solution for parameter values • Can we apply the same formalism to classification problems? • K-class Classification problem • 1-of-K binary coding for target vector t

Least Square Estimation Least Squares Logistic Regression Least Squares is Highly Sensitive to Outliers Figure taken from Bishop CM’s book – Pattern recognition & ML, Springer 2006.

Least Square – 3 classes Least Squares Logistic Regression Least Squares vs. Logistic Regression Figure taken from Bishop CM’s book – Pattern recognition & ML, Springer 2006.

Dimensionality Reduction • Reducing the number of random variables under consideration. • A technique for simplifying a high-dimensional data set by reducing its dimension for analysis. • Projection of high-dimensional data to a low-dimensional space that preserves the “important” characteristics of the data.

Dimensionality Reduction • One approach to deal with high dimensional data is by reducing their dimensionality. • Project high dimensional data onto a lower dimensional sub-space using linear or non-linear transformations.

Why Reduce Dimensionality? • In most learning algorithms, the complexity depends on • Dimensionality • Size of data sample • For reducing memory and computation requirements, we are interested in reducing the dimensionality of the problem • We need to guard against loss of information!s

Why Reduce Dimensionality? • Reduces time complexity: Less computation • Reduces space complexity: Less parameters • Saves the cost of observing the feature • Simpler models are more robust on small datasets • More interpretable; simpler explanation • Data visualization (structure, groups, outliers, etc) if plotted in 2 or 3 dimensions

Methods for Dimensionality Reduction • Two main methods • Feature Selection • Feature Extraction

Methods for Dimensionality Reduction • Feature selection: Choosing k<d important features, ignoring the remaining d – k Subset selection algorithms • Forward Selection (+) • Backward Selection (-) • Feature extraction: Project the original xi , i =1,...,d dimensions to new k<d dimensions, zj , j =1,...,k • Supervised • Fisher’s Linear Discrimination • Hidden Layers of NN • Unsupervised • PCA • SVD Linear projection methods

Methods for Dimensionality Reduction • Principle Component Analysis (PCA) (wait till Friday) • Best representing the data • Linear Discriminant Analysis (Fisher’s) (today) • Best discriminating the data • Singular Value Decomposition (SVD) (self study) • Factor Analysis (self study)

Dimensionality Reduction

Principal Component Analysis (PCA) • Dimensionality reduction implies information loss !! • Each dimensionality reduction technique finds an appropriate transformation by satisfying certain criteria (e.g., information loss, data discrimination, etc.) • PCA preserves as much information as possible

Linear Discriminant Analysis (LDA) • What is the goal of LDA? • Perform dimensionality reduction “while preserving as much of the class discriminatory information as possible”. • Seeks to find directions along which the classes are best separated. • Takes into consideration the scatter within-classesbut also the scatter between-classes.

Fisher’s Linear Discriminant

Linear Discriminant Analysis (LDA)

Fisher’s Linear DiscriminantBasic Idea Histograms resulting from projection onto line joining class means. Right plot shows FLD, showing greatly improved class separation Figure taken from Bishop CM’s book – Pattern recognition & ML, Springer 2006.

Fisher’s Linear Discriminant • Linear classification model can be viewed in terms of dimensionality reduction • 2-class problem in D-dimensional space Projection of a vector x on a unit vector w: Geometric interpretation: From training set we want to find out a direction w where the separation between the projections of class means is high and the projections of the class overlap is small

Fisher’s Linear Discriminant • For a 2-class problem FLD is a special case of least squares • Show it!

Summary

Classification: Linear Models

Classification: Linear Models

Presentation Transcript

Linear Programming Models in Services

Introduction to Generalized Linear Models

Linear models in Epidemiology

Recognition by Linear Combinations of Models

Linear Hierarchical Models

Introduction to R

Linear Classification Models: Generative

5-7: Predict with Linear Models

Linear Models II Wednesday, May 30, 10:15-12:00

Hyperspectral Image Classification

Locality-constrained Linear Coding for Image Classification

Linear Classification with Perceptrons

Lecture 5 Linear Mixed Effects Models

Week 7: General linear models

Applications of Linear and Integer Programming Models

Generalized Linear Models

Financial classification models – Part I: Discriminant Analysis

Chapter 3: Generalized Linear Models

Stochastic Gradient Descent Training for L1-regularizaed Log-linear Models with Cumulative Penalty

CSA3180: Natural Language Processing

Multivariate linear models for regression and classification Outline: