1 / 45

Machine Learning in Practice Lecture 6

Machine Learning in Practice Lecture 6. Carolyn Penstein Ros é Language Technologies Institute/ Human-Computer Interaction Institute. Plan for the Day. Announcements Clarification on Naïve Bayes with missing information Feedback on Quiz and Assignment Finish Naïve Bayes Start Linear Models.

Download Presentation

Machine Learning in Practice Lecture 6

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Machine Learning in PracticeLecture 6 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

  2. Plan for the Day • Announcements • Clarification on Naïve Bayes with missing information • Feedback on Quiz and Assignment • Finish Naïve Bayes • Start Linear Models

  3. Clarification on Unknown Values • Not a problem for Naïve Bayes • Probabilities computed using only the specified values • Likelihood that play = yes when Outlook = sunny, Temperature = cool, Humidity = high, Windy = true • 2/9 * 3/9 * 3/9 * 3/9 * 9/14 • If Outlook is unknown, 3/9 * 3/9 * 3/9 * 9/14 • Likelihoods will be higher when there are unknown values • Same effect on likelihood of all possible outcomes! • Factored out during normalization

  4. Quiz Notes • Most people did great! • Most frequent issue was the last question where we compared likelihoods and probabilities • Main difference is scaling • Sum of probabilities for all possible events should come out to 1 • That’s what gives statistical models their nice formal properties • Comment about technical versus common usages of terms like likelihood, concept, etc.

  5. Assignment 2 Notes • What is output from the machine learning model • Carolyn says: the model • Kishore says: the prediction • Some people had trouble identifying the impact of the noise • Different strategies for finding where the impact of the noise was • Error analysis • Didn’t notice effect on which information was taken into account

  6. Finishing Naïve Bayes

  7. Math Skill14 Math Skill13 Math Skill 1 Math Skill 2 Math Skill 3 Math Skill 5 Math Skill 6 Math Skill 4 Math Skill 8 Math Skill 9 Math Skill10 Math Skill11 Math Skill12 Math Skill 7 Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Scanario

  8. Math Skill13 Math Skill12 Math Skill14 Math Skill 1 Math Skill 3 Math Skill 4 Math Skill 5 Math Skill 2 Math Skill 7 Math Skill 8 Math Skill 9 Math Skill10 Math Skill11 Math Skill 6 Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Scanario Each problem may be associated with more than one skill

  9. Math Skill13 Math Skill12 Math Skill14 Math Skill 1 Math Skill 3 Math Skill 4 Math Skill 5 Math Skill 2 Math Skill 7 Math Skill 8 Math Skill 9 Math Skill10 Math Skill11 Math Skill 6 Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Scanario Each skill may be associated with more than one problem

  10. How to address the problem? • In reality there is a many-to-many mapping between math problems and skills

  11. How to address the problem? • In reality there is a many-to-many mapping between math problems and skills • Ideally, we should be able to assign any subset of the full set of skills to any problem • But can we do that accurately?

  12. How to address the problem? • In reality there is a many-to-many mapping between math problems and skills • Ideally, we should be able to assign any subset of the full set of skills to any problem • But can we do that accurately? • If we can’t do that, it may be good enough to assign the single most important skill

  13. How to address the problem? • In reality there is a many-to-many mapping between math problems and skills • Ideally, we should be able to assign any subset of the full set of skills to any problem • But can we do that accurately? • If we can’t do that, it may be good enough to assign the single most important skill • In that case, we will not accomplish the whole task

  14. How to address the problem? • But if we can do that part of the task more accurately, then we might accomplish more overall than if we try to achieve the more ambitious goal

  15. Low resolution gives more information if the accuracy is higher Remember this discussion from lecture 2?

  16. Which of these approaches is better? • You have a corpus of math problem texts and you are trying to learn models that assign skill labels. • Approch one: You have 91 binary prediction models, each of which makes an independent decision about each math text. • Approach two: You have one multi-class classifier that assigns one out of the same 91 skill labels.

  17. Math Skill13 Math Skill12 Math Skill14 Math Skill 1 Math Skill 3 Math Skill 4 Math Skill 5 Math Skill 2 Math Skill 7 Math Skill 8 Math Skill 9 Math Skill10 Math Skill11 Math Skill 6 Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Approach 1 Each skill corresponds to a separate binary predictor. Each of 91 binary predictors is applied to each text 91 separate predictions are made for each text.

  18. Math Skill13 Math Skill12 Math Skill14 Math Skill 1 Math Skill 3 Math Skill 4 Math Skill 5 Math Skill 2 Math Skill 7 Math Skill 8 Math Skill 9 Math Skill10 Math Skill11 Math Skill 6 Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Approach 2 Each skill corresponds to a separate Class value. A single multi- class predictor is applied to each text Only 1 prediction is made for each text.

  19. Which of these approaches is better? • You have a corpus of math problem texts and you are trying to learn models that assign skill labels. • Approch one: You have 91 binary prediction models, each of which makes an independent decision about each math text. • Approach two: You have one multi-class classifier that assigns one out of the same 91 skill labels. More power, but more opportunity for error

  20. Which of these approaches is better? • You have a corpus of math problem texts and you are trying to learn models that assign skill labels. • Approch one: You have 91 binary prediction models, each of which makes an independent decision about each math text. • Approach two: You have one multi-class classifier that assigns one out of the same 91 skill labels. Less power, but fewer opportunities for error

  21. Approach 1: One versus all • Assume you have 40 example texts, and 4 of them have skill5 associated with them • Assume you are using some form of smoothing – 0 counts become 1 • Let’s say WordX occurs with skill5 75% of the time and only once with any other class (it’s the best predictor for skill5) • After smoothing, P(WordX|Skill5) = 2/3 • P(WordX|majority) = 2/38

  22. Counts Without Smoothing • 40 math problem texts • 3 of them are skill5 • WordX occurs with skill5 75% of the time and occurs only once with any other class (it’s the best predictor for skill5) WordX WordY 3 Skill5 Majority Class 1

  23. Counts With Smoothing • 40 math problem texts • 3 of them are skill5 • WordX occurs with skill5 75% of the time and occurs only once with any other class (it’s the best predictor for skill5) WordX WordY 4 Skill5 Majority Class 2

  24. Approach 1 • Assume you have 40 example texts, and 3 of them have skill5 associated with them • Assume you are using some form of smoothing – 0 counts become 1 • Let’s say WordX occurs with skill5 75% of the time and only once with any other class (it’s the best predictor for skill5) • After smoothing, P(WordX|Skill5) = 2/3 • P(WordX|majority) = 2/38

  25. Approach 1 • Let’s say WordY occurs 17% of the time with the majority class and 25% of the time with skill5 (it’s a moderately good predictor) • In reality, 9 counts of WordY with majority and 1 with Skill5 • With smoothing, we get 10 counts of WordY with majority and 2 with Skill5 • P(WordY|Skill5) = 1/3 • P(WordY|Majority) = 7/38 • Because you multiply the conditional probabilities and the prior probabilities together, it’s nearly impossible to predict the minority class when the data is this skewed • For “WordX WordY” you would get .66*.33*.04 = .009 for skill5 and .05*.18 *.96 = .009 for majority • What would you predict without smoothing?

  26. Counts Without Smoothing • 40 math problem texts • 4 of them are skill5 • WordY occurs 17% of the time with the majority class and 25% of the time with skill5 (it’s a moderately good predictor) WordX WordY 3 1 Skill5 Majority Class 1 6

  27. Counts With Smoothing • 40 math problem texts • 4 of them are skill5 • WordY occurs 17% of the time with the majority class and 25% of the time with skill5 (it’s a moderately good predictor) WordX WordY 5 2 Skill5 Majority Class 1 7

  28. Approach 1 • Let’s say WordY occurs 17% of the time with the majority class and 25% of the time with skill5 (it’s a moderately good predictor) • In reality, 9 counts of WordY with majority and 1 with Skill5 • With smoothing, we get 10 counts of WordY with majority and 2 with Skill5 • P(WordY|Skill5) = 1/3 • P(WordY|Majority) = 7/38 • Because you multiply the conditional probabilities and the prior probabilities together, it’s nearly impossible to predict the minority class when the data is this skewed • For “WordX WordY” you would get .66*.33*.04 = .009 for skill5 and .05*.18 *.96 = .009 for majority • What would you predict without smoothing?

  29. Approach 1 • Let’s say WordY occurs 17% of the time with the majority class and 25% of the time with skill5 (it’s a moderately good predictor) • In reality, 9 counts of WordY with majority and 1 with Skill5 • With smoothing, we get 10 counts of WordY with majority and 2 with Skill5 • P(WordY|Skill5) = 1/3 • P(WordY|Majority) = 7/38 • Because you multiply the conditional probabilities and the prior probabilities together, it’s nearly impossible to predict the minority class when the data is this skewed • For “WordX WordY” you would get .66*.33*.04 = .009 for skill5 and .05*.18 *.96 = .009 for majority • What would you predict without smoothing?

  30. Linear Models

  31. Remember this: What do concepts look like?

  32. Remember this: What do concepts look like?

  33. R B S X X T X X X C X Review: Concepts as Lines

  34. R B S X X T X X X C X Review: Concepts as Lines

  35. R B S X X T X X X C X Review: Concepts as Lines

  36. R B S X X T X X X C X Review: Concepts as Lines

  37. R B S X X T X X X C X Review: Concepts as Lines X What will be the prediction for this new data point?

  38. What are we learning? • We’re learning to draw a line through a multidimensional space • Really a “hyperplane” • Each function we learn is like single split in a decision tree • But it can take many features into account at one time rather than just one • F(x) = X0 + C1X1 + C2X2 + C3X3 • X1-Xn are our attributes • C1-Cn are coefficients • We’re learning the coefficients, which are weights

  39. Taking a Step Back • We started out with tree learning a algorithms that learn symbolic rules with the goal of achieving the highest accuracy • 0R, 1R, Decision Trees (J48) • Then we talked about statistical models that make decisions based on probability • Naïve Bayes • Rules look different – we just store counts • No explicit focus on accuracy during learning • What are the implications of the contrast between an accuracy focus and a probability focus?

  40. Performing well with skewed class distributions • Naïve Bayes has trouble with skewed class distributions because of the contribution of prior probabilities • Remember our math problem case • Linear models can compensate for this • They don’t have any notion of prior probability per se • If they can find a good split on the data, they will find it wherever it is • Problem if there is not a good split

  41. Skewed but clean separation

  42. Skewed but clean separation

  43. Skewed but no clean separation

  44. Skewed but no clean separation

  45. Taking a Step Back • The models we will look at now have rules composed of numbers • So they “look” more like Naïve Bayes than like Decision Trees • But the numbers are obtained through a focus on achieving accuracy • So the learning process is more like Decision Trees • Given these two properties, what can you say about assumptions about the form of the solution and assumptions about the world that are made?

More Related