300 likes | 364 Views
Announcements. Homework 4 is due on this Thursday (02/27/2004) Project proposal is due on 03/02. Hierarchical Mixture Expert Model. Rong Jin. Decision Trees. Pro: Bring nonlinearity into the model Con: Each split is only based on a single attribute. Attribute 1 Attribute 2 classifier.
E N D
Announcements • Homework 4 is due on this Thursday (02/27/2004) • Project proposal is due on 03/02
Hierarchical Mixture Expert Model Rong Jin
Decision Trees • Pro: • Bring nonlinearity into the model • Con: • Each split is only based on a single attribute.
Attribute 1 Attribute 2 classifier Generalizing Decision Trees Each node is a linear classifier + + + + a decision tree using classifiers for data partition a decision tree with simple data partition
Generalized Decision Trees • Each node is a linear classifier • Pro: • Usually result in shallow trees • Introducing nonlinearity into linear classifiers (e.g. logistic regression) • Overcoming overfitting issues through the regularization mechanism within the classifier. • Better way to deal with real-value attributes • Example: • Neural network • Hierarchical Mixture Expert Model
x=0 x=0 Generalized Tree + + x=0 Example Kernel method
X r(x) Group Layer Group 1 g1(x) Group 2 g2(x) ExpertLayer m1,1(x) m1,2(x) m2,1(x) m2,2(x) y Hierarchical Mixture Expert Model (HME) • Ask r(x): which group should be used for classifying input x ? • If group 1 is chosen, which classifier m(x) should be used ? • Classify input x using the chosen classifier m(x)
X r(x) Group Layer Group 1 g1(x) Group 2 g2(x) ExpertLayer m2,1(x) m2,2(x) m1,1(x) m1,2(x) y Hierarchical Mixture Expert Model (HME)Probabilistic Description Two hidden variables The hidden variable for groups: g = {1, 2} The hidden variable for classifiers: m = {11, 12, 21, 22}
X r(x) Group Layer Group 1 g1(x) Group 2 g2(x) ExpertLayer m2,1(x) m2,2(x) m1,1(x) m1,2(x) y Hierarchical Mixture Expert Model (HME)Example r(+1|x) = ¾, r(-1|x) = ¼ g1(+1|x) = ¼, g1(-1|x) = ¾ g2(+1|x) = ½ , g2(-1|x) = ½ ¾ ¼ ¾ ½ ½ ¼ p(+1|x) = ?, p(-1|x) = ?
Training HME • In the training examples {xi, yi} • No information about r(x), g(x) for each example • Random variables g, m are called hidden variables since they are not exposed in the training data. • How to train a model with hidden variable?
x Start with Random Guess … +: {1, 2, 3, 4, 5} : {6, 7, 8, 9} • Randomly Assignment • Randomly assign points to each group and expert • Learn classifiers r(x), g(x), m(x) using the randomly assigned points r(x) Group Layer {1,2,}{6,7} {3,4,5}{8,9} g1(x) g2(x) ExpertLayer m1,1(x) m1,2(x) m2,1(x) m2,2(x) {1}{6} {2}{7} {3}{9} {5,4}{8}
x Adjust Group Memeberships • The key is to assign each data point to the group who classifies the data point correctly with the largest probability • How ? +: {1, 2, 3, 4, 5} : {6, 7, 8, 9} r(x) Group Layer {1,2}{6,7} {3,4,5}{8,9} g1(x) g2(x) ExpertLayer m1,1(x) m1,2(x) m2,1(x) m2,2(x) {1}{6} {2}{7} {3}{9} {5,4}{8}
x Adjust Group Memberships • The key is to assign each data point to the group who classifies the data point correctly with the largest confidence • Compute p(g=1|x,y) and p(g=2|x,y) +: {1, 2, 3, 4, 5} : {6, 7, 8, 9} r(x) Group Layer {1,2}{6,7} {3,4,5}{8,9} g1(x) g2(x) ExpertLayer m1,1(x) m1,2(x) m2,1(x) m2,2(x) {1}{6} {2}{7} {3}{9} {5,4}{8}
x Adjust Memberships for Classifiers • The key is to assign each data point to the classifier who classifies the data point correctly with the largest confidence • Compute p(m=1,1|x, y), p(m=1,2|x, y), p(m=2,1|x, y), p(m=2,2|x, y) +: {1, 2, 3, 4, 5} : {6, 7, 8, 9} r(x) Group Layer {1,5}{6,7} {2,3,4}{8,9} g1(x) g2(x) ExpertLayer m1,1(x) m1,2(x) m2,1(x) m2,2(x)
x Adjust Memberships for Classifiers • The key is to assign each data point to the classifier who classifies the data point correctly with the largest confidence • Compute p(m=1,1|x, y), p(m=1,2|x, y), p(m=2,1|x, y), p(m=2,2|x, y) +: {1, 2, 3, 4, 5} : {6, 7, 8, 9} r(x) Group Layer {1,5}{6,7} {2,3,4}{8,9} g1(x) g2(x) ExpertLayer m1,1(x) m1,2(x) m2,1(x) m2,2(x)
x Adjust Memberships for Classifiers • The key is to assign each data point to the classifier who classifies the data point correctly with the largest confidence • Compute p(m=1,1|x, y), p(m=1,2|x, y), p(m=2,1|x, y), p(m=2,2|x, y) +: {1, 2, 3, 4, 5} : {6, 7, 8, 9} r(x) Group Layer {1,5}{6,7} {2,3,4}{8,9} g1(x) g2(x) ExpertLayer m1,1(x) m1,2(x) m2,1(x) m2,2(x)
x Adjust Memberships for Classifiers • The key is to assign each data point to the classifier who classifies the data point correctly with the largest confidence • Compute p(m=1,1|x, y), p(m=1,2|x, y), p(m=2,1|x, y), p(m=2,2|x, y) +: {1, 2, 3, 4, 5} : {6, 7, 8, 9} r(x) Group Layer {1,5}{6,7} {2,3,4}{8,9} g1(x) g2(x) ExpertLayer m1,1(x) m1,2(x) m2,1(x) m2,2(x) {1}{6} {5}{7} {2,3}{9} {4}{8}
x Retrain The Model +: {1, 2, 3, 4, 5} : {6, 7, 8, 9} r(x) • Retrain r(x), g(x), m(x) using the new memberships Group Layer {1,5}{6,7} {2,3,4}{8,9} g1(x) g2(x) ExpertLayer m1,1(x) m1,2(x) m2,1(x) m2,2(x) {1}{6} {5}{7} {2,3}{9} {4}{8}
M-step • Train r(x;r) using weighted examples: for each x, p(g=1|x) fraction as a positive example, and p(g=2|x) fraction as a negative example • Train g1(x; g)using weighted examples: for each x, p(g=1|x)p(m=11|x,g=1) fraction as a positive example and p(g=1|x)p(m=12|x,g=1) fraction as a negative example. Training g2(x; g)similarly • Train m(x;m) with appropriately weighted examples Expectation Maximization • Two things need to estimate • Logistic regression models for r(x;r), g(x; g) and m(x;m) • Unknown group memberships and expert memberships p(g=1,2|x), p(m=11,12|x,g=1), p(m=21,22|x,g=2) • E-step • Estimate p(g=1|x, y), p(g=2|x, y) for training examples, given guessed r(x;r), g(x;g) and m(x;m) • Estimate p(m=11, 12|x, y) and p(m=21, 22|x, y) for all training examples, given guessed r(x;r), g(x;g) and m(x;m)
Comparison of Different Classification Models • The goal of all classifiers • Predicating class label y for an input x • Estimate p(y|x) • Gaussian generative model • p(y|x) ~ p(x|y) p(y): posterior = likelihoodprior • p(x|y) • Describing the input patterns for each class y • Difficult to estimate if x is of high dimensionality • Naïve Bayes: p(x|y) ~ p(x1|y) p(x2|y)… p(xm|y) • Essentially a linear model • Linear discriminative model • Directly estimate p(y|x) • Focusing on finding the decision boundary
Comparison of Different Classification Models • Logistic regression model • A linear decision boundary: wx+b • A probabilistic model p(y|x) • Maximum likelihood approach for estimating weights w and threshold b
Regularization term Comparison of Different Classification Models • Logistic regression model • Overfitting issue • Example: text classification • Every word is assigned with a different weight • Words that appears in only one document will be assigned with infinite large weight • Solution: regularization
Maximize Entropy Prefer uniform distribution Constraints Enforce the model to be consistent with observed data Comparison of Different Classification Models • Conditional exponential model • An extension of logistic regression model to multiple class case • A different set of weights wy and threshold b for each class y • Maximum entropy model • Finding the simplest model that matches with the data
Support Vectors Classification Margin Comparison of Different Classification Models • Support vector machine • Classification margin • Maximum margin principle: • Separate data far away from the decision boundary • Two objectives • Minimize the classification error over training data • Maximize the classification margin • Support vector • Only support vectors have impact on the location of decision boundary denotes +1 denotes -1
Comparison of Different Classification Models • Separable case • Noisy case Quadratic programming!
Identical terms Log-likelihood can be viewed as a measurement of accuracy Comparison of Classification Models • Logistic regression model vs. support vector machine
Comparison of Different Classification Models Logistic regression differs from support vector machine only in the loss function
Classification boundary that achieves the least training error • Classification boundary that achieves large margin Comparison of Different Classification Models • Generative models have trouble at the decision boundary
x=0 x=0 Nonlinear Models • Kernel methods • Add additional dimensions to help separate data • Efficiently computing the dot product in a high dimension space Kernel method
r(x) Group Layer Group 1 g1(x) Group 2 g2(x) + + + ExpertLayer m2,1(x) m2,2(x) m1,1(x) m1,2(x) Nonlinear Models • Decision trees • Nonlinearly combine different features through a tree structure • Hierarchical Mixture Model • Replace each node with a logistic regression model • Nonlinearly combine multiple linear models