1 / 41

Review

Review. Rong Jin. Comparison of Different Classification Models. The goal of all classifiers Predicating class label y for an input x Estimate p ( y | x ). (k=4). (k=1). Probability interpretation: estimate p ( y | x ) as. K Nearest Neighbor (kNN) Approach.

loren
Download Presentation

Review

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Review Rong Jin

  2. Comparison of Different Classification Models • The goal of all classifiers • Predicating class label y for an input x • Estimate p(y|x)

  3. (k=4) (k=1) Probability interpretation: estimate p(y|x) as K Nearest Neighbor (kNN) Approach

  4. K Nearest Neighbor Approach (KNN) • What is the appropriate size for neighborhood N(x)? • Leave one out approach • Weight K nearest neighbor • Neighbor is defined through a weight function • Estimate p(y|x) • How to estimate the appropriate value for 2?

  5. K Nearest Neighbor Approach (KNN) • What is the appropriate size for neighborhood N(x)? • Leave one out approach • Weight K nearest neighbor • Neighbor is defined through a weight function • Estimate p(y|x) • How to estimate the appropriate value for 2?

  6. K Nearest Neighbor Approach (KNN) • What is the appropriate size for neighborhood N(x)? • Leave one out approach • Weight K nearest neighbor • Neighbor is defined through a weight function • Estimate p(y|x) • How to estimate the appropriate value for 2?

  7. Weighted K Nearest Neighbor • Leave one out + maximum likelihood • Estimate leave one out probability • Leave one out likelihood of training data • Search the optimal 2 by maximizing the leave one out likelihood

  8. Weight K Nearest Neighbor • Leave one out + maximum likelihood • Estimate leave one out probability • Leave one out likelihood of training data • Search the optimal 2 by maximizing the leave one out likelihood

  9. Gaussian Generative Model • p(y|x) ~ p(x|y) p(y): posterior = likelihoodprior • Estimate p(x|y) and p(y) • Allocate a separate set of parameters for each class •   {1, 2,…,c} • p(xly;)  p(x;y) • Maximum likelihood estimation

  10. Gaussian Generative Model • p(y|x) ~ p(x|y) p(y): posterior = likelihoodprior • Estimate p(x|y) and p(y) • Allocate a separate set of parameters for each class •   {1, 2,…,c} • p(xly;)  p(x;y) • Maximum likelihood estimation

  11. Gaussian Generative Model • Difficult to estimate p(x|y) if x is of high dimensionality • Naïve Bayes: • Essentially a linear model • How to make a Gaussian generative model discriminative? • (m,m) of each class are only based on the data belonging to that class  lack of discriminative power

  12. How to optimize this objective function? Gaussian Generative Model • Maximum likelihood estimation

  13. Gaussian Generative Model • Bound optimization algorithm

  14. Gaussian Generative Model We have decomposed the interaction of parameters between different classes Question: how to handle x with multiple features ?

  15. Logistic Regression Model • A linear decision boundary: wx+b • A probabilistic model p(y|x) • Maximum likelihood approach for estimating weights w and threshold b

  16. Regularization term Logistic Regression Model • Overfitting issue • Example: text classification • Words that appears in only one document will be assigned with infinite large weight • Solution: regularization

  17. Non-linear Logistic Regression Model • Kernelize logistic regression model

  18. r(x) Group Layer Group 1 g1(x) Group 2 g2(x) ExpertLayer m2,1(x) m2,2(x) m1,1(x) m1,2(x) Non-linear Logistic Regression Model • Hierarchical Mixture Expert Model • Group linear classifiers into a tree structure Products generates nonlinearity in the prediction function

  19. Non-linear Logistic Regression Model • It could be a rough assumption by assuming all data points can be fitted by a linear model • But, it is usually appropriate to assume a local linear model • KNN can be viewed as a localized model without any parameters • Can we extend the KNN approach by introducing a localized linear model?

  20. Localized Logistic Regression Model • Similar to the weight KNN • Weigh each training example by • Build a logistic regression model using the weighted examples

  21. Localized Logistic Regression Model • Similar to the weight KNN • Weigh each training example by • Build a logistic regression model using the weighted examples

  22. Conditional Exponential Model • An extension of logistic regression model to multiple class case • A different set of weights wy and threshold b for each class y • Translation invariance

  23. Maximize Entropy  Prefer uniform distribution Constraints  Enforce the model to be consistent with observed data Maximum Entropy Model • Finding the simplest model that matches with the data • Iterative scaling methods for optimization

  24. Classification Margin Support Vector Machine • Classification margin • Maximum margin principle: • Separate data far away from the decision boundary • Two objectives • Minimize the classification error over training data • Maximize the classification margin • Support vectors • Only support vectors have impact on the location of decision boundary denotes +1 denotes -1

  25. Support Vectors Support Vector Machine • Classification margin • Maximum margin principle: • Separate data far away from the decision boundary • Two objectives • Minimize the classification error over training data • Maximize the classification margin • Support vectors • Only support vectors have impact on the location of decision boundary denotes +1 denotes -1

  26. Support Vector Machine • Separable case • Noisy case

  27. Quadratic programming! Support Vector Machine • Separable case • Noisy case

  28. Different loss function for punishing mistakes Identical terms Logistic Regression Model vs. Support Vector Machine • Logistic regression model • Support vector machine

  29. Logistic Regression Model vs. Support Vector Machine Logistic regression differs from support vector machine only in the loss function

  30. Kernel Tricks • Introducing nonlinearity into the discriminative models • Diffusion kernel • A graph laplacian L for local similarity • Diffusion kernel • Propagate local similarity information into a global one

  31. Original Input Space Measure the similarity in the model space Fisher Kernel • Derive a kernel function from a generative model • Key idea • Map a point x in original input space into the model space • The similarity of two data points are measured in the model space Model Space

  32. Kernel Methods in Generative Model • Usually, kernels can be introduced to a generative model through a Gaussian process • Define a “kernelized” covariance matrix • Positive semi-definitive, similar to Mercer’s condition

  33. Multi-class SVM • SVMs can only handle two-class outputs • One-against-all • Learn N SVM’s • SVM 1 learns “Output==1” vs “Output != 1” • SVM 2 learns “Output==2” vs “Output != 2” • : • SVM N learns “Output==N” vs “Output != N”

  34. S1 S2 S3 S4 Error Correct Output Code (ECOC) • Encode each class into a bit vector 1 1 2 x 1 1 1 0

  35. w’ ‘good’ ‘OK’ ‘bad’ Ordinal Regression • A special class of multi-class classification problem • There a natural ordinal relationship between multiple classes • Maximum margin principle • The computation of margin involves multiple classes

  36. Ordinal Regression

  37. Decision Tree From slides of Andrew Moore

  38. Decision Tree • A greedy approach for generating a decision tree • Choose the most informative feature • Using the mutual information measurements • Split data set according to the values of the selected feature • Recursive until each data item is classified correctly • Attributes with real values • Quantize the real value into a discrete one

  39. Decision Tree • The overfitting problem • Tree pruning • Reduced error pruning • Rule post-pruning

  40. Decision Tree • The overfitting problem • Tree pruning • Reduced error pruning • Rule post-pruning

  41. Attribute 1 Attribute 2 classifier Generalize Decision Tree Each node is a linear classifier   +    +  + + a decision tree using classifiers for data partition a decision tree with simple data partition

More Related