1 / 48

Announcements

Learn about Hierarchical Mixture Expert Model, decision trees, linear classifiers, and data partitioning in pattern recognition. Explore examples and applications.

tonyac
Download Presentation

Announcements

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Announcements • Project proposal is due on 03/11 • Three seminars this Friday (EB 3105) • Dealing with Indefinite Representations in Pattern Recognition (10:00 am - 11:00 am) • Computational Analysis of Drosophila Gene Expression Pattern Image (11:00 am - 12:00 pm) • 3D General Lesion Segmentation in CT (3:00 pm - 4:00 pm)

  2. Hierarchical Mixture Expert Model Rong Jin

  3. Good Things about Decision Trees • Decision trees introduce nonlinearity through the tree structure • Viewing A^B^C as A*B*C • Compared to kernel methods • Less adhoc • Easy understanding

  4. x=0 x=0 Generalized Tree +   + x=0 Example In general, mixture model is powerful in fitting complex decision boundary, for instance, stacking, boosting, bagging Kernel method

  5. Each node of decision tree only depends on a single feature. Is this the best idea? Generalize Decision Trees From slides of Andrew Moore

  6. Partition Datasets • The goal of each node is to partition the data set into disjoint subsets such that each subset is easier to classify. cylinders = 4 Partition by a single attribute Original Dataset cylinders = 5 cylinders = 6 cylinders = 8

  7. Partition Datasets (cont’d) • More complicated partitions Cylinders< 6 and Weight > 4 ton Partition by multiple attributes Original Dataset Other cases Using a classification model for each node Cylinders  6 and Weight < 3 ton • How to accomplish such a complicated partition? • Each partition  a class • Partition a dataset into disjoint subsets  Classify a dataset into multiple classes

  8. Attribute 1 Attribute 2 classifier A More General Decision Tree Each node is a linear classifier   +    +  + + a decision tree using classifiers for data partition a decision tree with simple data partition

  9.  + General Schemes for Decision Trees • Each node within the tree is a linear classifier • Pro: • Usually result in shallow trees • Introducing nonlinearity into linear classifiers (e.g. logistic regression) • Overcoming overfitting issues through the regularization mechanism within the classifier. • Partition datasets with soft memberships • A better way to deal with real-value attributes • Example: • Neural network • Hierarchical Mixture Expert Model

  10. x Router Decides which classifier should x be route to Classifier Determines the class for input x Hierarchical Mixture Expert Model (HME) r(x) Group Layer Group 1 g1(x) Group 2 g2(x) ExpertLayer m1,1(x) m1,2(x) m2,1(x) m2,2(x)

  11. x ? ? Which group should be used for classifying x ? Hierarchical Mixture Expert Model (HME) r(x) Group Layer Group 1 g1(x) Group 2 g2(x) ExpertLayer m1,1(x) m1,2(x) m2,1(x) m2,2(x)

  12. x Hierarchical Mixture Expert Model (HME) r(x) Group Layer r(x) = +1 Group 1 g1(x) Group 2 g2(x) ExpertLayer m1,1(x) m1,2(x) m2,1(x) m2,2(x)

  13. x ? ? Which expert should be used for classifying x ? Hierarchical Mixture Expert Model (HME) r(x) Group Layer Group 1 g1(x) Group 2 g2(x) ExpertLayer m1,1(x) m1,2(x) m2,1(x) m2,2(x)

  14. x Hierarchical Mixture Expert Model (HME) r(x) Group Layer Group 1 g1(x) Group 2 g2(x) g1(x) = -1 ExpertLayer m1,1(x) m1,2(x) m2,1(x) m2,2(x)

  15. x Hierarchical Mixture Expert Model (HME) r(x) Group Layer Group 1 g1(x) Group 2 g2(x) ExpertLayer m1,1(x) m1,2(x) m2,1(x) m2,2(x) m1,2(x)=+1 The class label for +1

  16. x ? ? Which group should be used for classifying x ? Hierarchical Mixture Expert Model (HME) More Complicated Case r(x) Group Layer Group 1 g1(x) Group 2 g2(x) ExpertLayer m1,1(x) m1,2(x) m2,1(x) m2,2(x)

  17. x Hierarchical Mixture Expert Model (HME) More Complicated Case r(x) r(+1|x) = ¾, r(-1|x) = ¼ Group Layer Group 1 g1(x) Group 2 g2(x) ExpertLayer m1,1(x) m1,2(x) m2,1(x) m2,2(x)

  18. Hierarchical Mixture Expert Model (HME) More Complicated Case r(x) r(+1|x) = ¾, r(-1|x) = ¼ Group Layer Group 1 g1(x) Group 2 g2(x) x x ? ? ? ? Which expert should be used for classifying x ? ExpertLayer m1,1(x) m1,2(x) m2,1(x) m2,2(x)

  19. x Hierarchical Mixture Expert Model (HME) More Complicated Case r(x) r(+1|x) = ¾, r(-1|x) = ¼ Group Layer g1(+1|x) = ¼, g1(-1|x) = ¾ g2(+1|x) = ½ , g2(-1|x) = ½ Group 1 g1(x) Group 2 g2(x) x ExpertLayer m1,1(x) m1,2(x) m2,1(x) m2,2(x) How to compute the probability p(+1|x) and p(-1|x)?

  20. HME: Probabilistic Description r(x) Random variable g = {1, 2} r(+1|x)=p(g = 1|x), r(-1|x)=p(g = 2|x) Group Layer Group 1 g1(x) Group 2 g2(x) Random variable m = {11, 12, 21, 22} g1(+1|x) = p(m=11|x, g=1), g1(-1|x) = p(m=12|x, g=1) g2(+1|x) =p(m=21|x, g=2) g2(-1|x) =p(m=22|x, g=2) ExpertLayer m1,1(x) m1,2(x) m2,1(x) m2,2(x)

  21. HME: Probabilistic Description r(x) r(+1|x) = ¾, r(-1|x) = ¼ Group Layer Group 1 g1(x) Group 2 g2(x) g1(+1|x) = ¼, g1(-1|x) = ¾ g2(+1|x) = ½ , g2(-1|x) = ½ ExpertLayer m1,1(x) m1,2(x) m2,1(x) m2,2(x) Compute P(+1|x) and P(-1|x)

  22. HME: Probabilistic Description r(x) r(+1|x) = ¾, r(-1|x) = ¼ Group Layer g1(+1|x) = ¼, g1(-1|x) = ¾ g2(+1|x) = ½ , g2(-1|x) = ½ Group 1 g1(x) Group 2 g2(x) ExpertLayer m1,1(x) m1,2(x) m2,1(x) m2,2(x)

  23. HME: Probabilistic Description r(+1|x) = ¾, r(-1|x) = ¼ g1(+1|x) = ¼, g1(-1|x) = ¾ g2(+1|x) = ½ , g2(-1|x) = ½

  24. x Hierarchical Mixture Expert Model (HME) r(x) Group Layer Group 1 g1(x) Group 2 g2(x) y ExpertLayer m1,1(x) m1,2(x) m2,1(x) m2,2(x) Is HME more powerful than a simple majority vote approach?

  25. Problem with Training HME • Using logistic regression to model r(x), g(x), and m(x) • No training examples r(x), g(x) • For each training example (x, y), we don’t know its group ID or expert ID. • can’t apply training procedure of logistic regression model to train r(x) and g(x) directly. • Random variables g, m are called hidden variables since they are not exposed in the training data. • How to train a model with incomplete data?

  26. x Start with Random Guess … • Iteration 1: random guess: • Randomly assign points to groups and experts +: {1, 2, 3, 4, 5} : {6, 7, 8, 9} r(x) Group Layer g1(x) g2(x) ExpertLayer m1,1(x) m1,2(x) m2,1(x) m2,2(x)

  27. x Start with Random Guess … • Iteration 1: random guess: • Randomly assign points to groups and experts +: {1, 2, 3, 4, 5} : {6, 7, 8, 9} r(x) Group Layer {1,2,}{6,7} {3,4,5}{8,9} g1(x) g2(x) ExpertLayer m1,1(x) m1,2(x) m2,1(x) m2,2(x) {1}{6} {2}{7} {3}{9} {5,4}{8}

  28. x Start with Random Guess … • Iteration 1: random guess: • Randomly assign points to groups and experts • Learn r(x), g1(x), g2(x), m11(x), m12(x), m21(x), m22(x) +: {1, 2, 3, 4, 5} : {6, 7, 8, 9} r(x) Group Layer {1,2,}{6,7} {3,4,5}{8,9} g1(x) g2(x) ExpertLayer m1,1(x) m1,2(x) m2,1(x) m2,2(x) {1}{6} {2}{7} {3}{9} {5,4}{8} Now, what should we do?

  29. x Refine HME Model • Iteration 2: regroup data points • Reassign the group membership to each data point • Reassign the expert membership to each expert +: {1, 2, 3, 4, 5} : {6, 7, 8, 9} r(x) Group Layer {1,5}{6,7} {2,3,4}{8,9} g1(x) g2(x) ExpertLayer m1,1(x) m1,2(x) m2,1(x) m2,2(x) But, how?

  30. x r(x) Group Layer Group 1 g1(x) Group 2 g2(x) ExpertLayer m1,1(x) m1,2(x) m2,1(x) m2,2(x) Determine Group Memberships Consider an example (x, +1) r(+1|x) = ¾, r(-1|x) = ¼ g1(+1|x) = ¼, g1(-1|x) = ¾ g2(+1|x) = ½ , g2(-1|x) = ½ Compute the posterior on your own sheet !

  31. x r(x) Group Layer Group 1 g1(x) Group 2 g2(x) ExpertLayer m1,1(x) m1,2(x) m2,1(x) m2,2(x) Determine Group Memberships Consider an example (x, +1) r(+1|x) = ¾, r(-1|x) = ¼ g1(+1|x) = ¼, g1(-1|x) = ¾ g2(+1|x) = ½ , g2(-1|x) = ½

  32. x r(x) Group Layer Group 1 g1(x) Group 2 g2(x) ExpertLayer m1,1(x) m1,2(x) m2,1(x) m2,2(x) Determine Expert Memberships Consider an example (x, +1) r(+1|x) = ¾, r(-1|x) = ¼ g1(+1|x) = ¼, g1(-1|x) = ¾ g2(+1|x) = ½ , g2(-1|x) = ½

  33. x Refine HME Model • Iteration 2: regroup data points • Reassign the group membership to each data point • Reassign the expert membership to each expert • Compute the posteriors p(g|x,y) and p(m|x,y,g) for each training example (x,y) • Retrain r(x), g1(x), g2(x), m11(x), m12(x), m21(x), m22(x) using estimated posteriors +: {1, 2, 3, 4, 5} : {6, 7, 8, 9} r(x) Group Layer {1,5}{6,7} {2,3,4}{8,9} g1(x) g2(x) ExpertLayer m1,1(x) m1,2(x) m2,1(x) m2,2(x) But, how ?

  34. Logistic Regression: Soft Memberships • Example: train r(x) Soft memberships

  35. Logistic Regression: Soft Memberships • Example: train m11(x) Soft memberships

  36. x Start with Random Guess … • Iteration 2: regroup data points • Reassign the group membership to each data point • Reassign the expert membership to each expert • Compute the posteriors p(g|x,y) and p(m|x,y,g) for each training example (x,y) • Retrain r(x), g1(x), g2(x), m11(x), m12(x), m21(x), m22(x) +: {1, 2, 3, 4, 5} : {6, 7, 8, 9} r(x) Group Layer {1,5}{6,7} {2,3,4}{8,9} g1(x) g2(x) ExpertLayer m1,1(x) m1,2(x) m2,1(x) m2,2(x) {1}{6} {5}{7} {2,3}{9} {4}{8} Repeat the above procedure until it converges (it guarantees to converge a local minimum) This is famous Expectation-Maximization Algorithm (EM) !

  37. M-step • Fixed memberships and learn logistic regression models • Train r(x;r) using soft memberships p(g=1|x,y) and p(g=2|x,y) • Train g1(x; g)and g2(x; g) using soft memberships p(m=11, 12|x, y, g=1), p(m=21, 22|x, y,g=2) • Train m11(x;m), m12(x;m), m21(x;m), and m22(x;m) using soft memberships p(m=11,12|x,y,g=1), p(m=21,22|x,y,g=2) Formal EM algorithm for HME • Unknown logistic regression models • r(x;r), {gi(x; g)} and {mi(x;m)} • Unknown group memberships and expert memberships • p(g|x,y), p(m|x, y, g) • E-step • Fixed logistic regression model and estimate memberships: • Estimate p(g=1|x,y), p(g=2|x,y) for all training examples • Estimate p(m=11, 12|x, y, g=1) and p(m=21, 22|x, y, g=2) for all training examples

  38. M-step • Fixed memberships and learn logistic regression models • Train r(x;r) using soft memberships p(g=1|x,y) and p(g=2|x,y) • Train g1(x; g)and g2(x; g) using soft memberships p(m=11, 12|x, y, g=1), p(m=21, 22|x, y,g=2) • Train m11(x;m), m12(x;m), m21(x;m), and m22(x;m) using soft memberships p(m=11,12|x,y,g=1), p(m=21,22|x,y,g=2) Formal EM algorithm for HME • Unknown logistic regression models • r(x;r), {gi(x; g)} and {mi(x;m)} • Unknown group memberships and expert memberships • p(g|x,y), p(m|x, y, g) • E-step • Fixed logistic regression model and estimate memberships: • Estimate p(g=1|x,y), p(g=2|x,y) for all training examples • Estimate p(m=11, 12|x, y, g=1) and p(m=21, 22|x, y, g=2) for all training examples

  39. What are We Doing? • What is the objective of doing Expectation-Maximization? • It is still a simple maximum likelihood! • Expectation-Maximization algorithm actually tries to maximize the log-likelihood function • Most time, it converges to local maximum, not a global one • Improved version: annealing EM

  40. Annealing EM

  41. Improve HME • It is sensitive to initial assignments • How can we reduce the risk of initial assignments? • Binary tree  K-way trees • Logistic regression  conditional exponential model • Tree structure • Can we determine the optimal tree structure for a given dataset?

  42. Comparison of Classification Models • The goal of classifier • Predicting class label y for an input x • Estimate p(y|x) • Gaussian generative model • p(y|x) ~ p(x|y) p(y): posterior = likelihoodprior • Difficulty in estimating p(x|y) if x comprises of multiple elements • Naïve Bayes: p(x|y) ~ p(x1|y) p(x2|y)… p(xd|y) • Linear discriminative model • Estimate p(y|x) • Focusing on finding the decision boundary

  43. Comparison of Classification Models • Logistic regression model • A linear decision boundary: wx+b • A probabilistic model p(y|x) • Maximum likelihood approach for estimating weights w and threshold b

  44. Comparison of Classification Models • Logistic regression model • Overfitting issue • In text classification problem, words that only appears in only one document will be assigned with infinite large weight • Solution: regularization • Conditional exponential model • Maximum entropy model • A dual problem of conditional exponential model

  45. denotes +1 denotes -1 Support Vectors Comparison of Classification Models • Support vector machine • Classification margin • Maximum margin principle: two objective • Minimize the classification error over training data • Maximize classification margin • Support vector • Only support vectors have impact on the location of decision boundary

  46. Comparison of Classification Models • Separable case • Noisy case Quadratic programming!

  47. Identical terms Log-likelihood can be viewed as a measurement of accuracy Comparison of Classification Models • Similarity between logistic regression model and support vector machine Logistic regression model is almost identical to support vector machine except for different expression for classification errors

  48. Classification boundary that achieves the least training error • Classification boundary that achieves large margin Comparison of Classification Models • Generative models have trouble at the decision boundary

More Related