Machine Learning in Practice Lecture 6

Machine Learning in PracticeLecture 6 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute

Plan for the Day • Announcements • Clarification on Naïve Bayes with missing information • Feedback on Quiz and Assignment • Finish Naïve Bayes • Start Linear Models

Clarification on Unknown Values • Not a problem for Naïve Bayes • Probabilities computed using only the specified values • Likelihood that play = yes when Outlook = sunny, Temperature = cool, Humidity = high, Windy = true • 2/9 * 3/9 * 3/9 * 3/9 * 9/14 • If Outlook is unknown, 3/9 * 3/9 * 3/9 * 9/14 • Likelihoods will be higher when there are unknown values • Same effect on likelihood of all possible outcomes! • Factored out during normalization

Quiz Notes • Most people did great! • Most frequent issue was the last question where we compared likelihoods and probabilities • Main difference is scaling • Sum of probabilities for all possible events should come out to 1 • That’s what gives statistical models their nice formal properties • Comment about technical versus common usages of terms like likelihood, concept, etc.

Assignment 2 Notes • What is output from the machine learning model • Carolyn says: the model • Kishore says: the prediction • Some people had trouble identifying the impact of the noise • Different strategies for finding where the impact of the noise was • Error analysis • Didn’t notice effect on which information was taken into account

Finishing Naïve Bayes

Math Skill14 Math Skill13 Math Skill 1 Math Skill 2 Math Skill 3 Math Skill 5 Math Skill 6 Math Skill 4 Math Skill 8 Math Skill 9 Math Skill10 Math Skill11 Math Skill12 Math Skill 7 Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Scanario

Math Skill13 Math Skill12 Math Skill14 Math Skill 1 Math Skill 3 Math Skill 4 Math Skill 5 Math Skill 2 Math Skill 7 Math Skill 8 Math Skill 9 Math Skill10 Math Skill11 Math Skill 6 Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Scanario Each problem may be associated with more than one skill

Math Skill13 Math Skill12 Math Skill14 Math Skill 1 Math Skill 3 Math Skill 4 Math Skill 5 Math Skill 2 Math Skill 7 Math Skill 8 Math Skill 9 Math Skill10 Math Skill11 Math Skill 6 Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Scanario Each skill may be associated with more than one problem

How to address the problem? • In reality there is a many-to-many mapping between math problems and skills

How to address the problem? • In reality there is a many-to-many mapping between math problems and skills • Ideally, we should be able to assign any subset of the full set of skills to any problem • But can we do that accurately?

How to address the problem? • In reality there is a many-to-many mapping between math problems and skills • Ideally, we should be able to assign any subset of the full set of skills to any problem • But can we do that accurately? • If we can’t do that, it may be good enough to assign the single most important skill

How to address the problem? • In reality there is a many-to-many mapping between math problems and skills • Ideally, we should be able to assign any subset of the full set of skills to any problem • But can we do that accurately? • If we can’t do that, it may be good enough to assign the single most important skill • In that case, we will not accomplish the whole task

How to address the problem? • But if we can do that part of the task more accurately, then we might accomplish more overall than if we try to achieve the more ambitious goal

Low resolution gives more information if the accuracy is higher Remember this discussion from lecture 2?

Which of these approaches is better? • You have a corpus of math problem texts and you are trying to learn models that assign skill labels. • Approch one: You have 91 binary prediction models, each of which makes an independent decision about each math text. • Approach two: You have one multi-class classifier that assigns one out of the same 91 skill labels.

Math Skill13 Math Skill12 Math Skill14 Math Skill 1 Math Skill 3 Math Skill 4 Math Skill 5 Math Skill 2 Math Skill 7 Math Skill 8 Math Skill 9 Math Skill10 Math Skill11 Math Skill 6 Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Approach 1 Each skill corresponds to a separate binary predictor. Each of 91 binary predictors is applied to each text 91 separate predictions are made for each text.

Math Skill13 Math Skill12 Math Skill14 Math Skill 1 Math Skill 3 Math Skill 4 Math Skill 5 Math Skill 2 Math Skill 7 Math Skill 8 Math Skill 9 Math Skill10 Math Skill11 Math Skill 6 Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Math story problem Approach 2 Each skill corresponds to a separate Class value. A single multi- class predictor is applied to each text Only 1 prediction is made for each text.

Which of these approaches is better? • You have a corpus of math problem texts and you are trying to learn models that assign skill labels. • Approch one: You have 91 binary prediction models, each of which makes an independent decision about each math text. • Approach two: You have one multi-class classifier that assigns one out of the same 91 skill labels. More power, but more opportunity for error

Which of these approaches is better? • You have a corpus of math problem texts and you are trying to learn models that assign skill labels. • Approch one: You have 91 binary prediction models, each of which makes an independent decision about each math text. • Approach two: You have one multi-class classifier that assigns one out of the same 91 skill labels. Less power, but fewer opportunities for error

Approach 1: One versus all • Assume you have 40 example texts, and 4 of them have skill5 associated with them • Assume you are using some form of smoothing – 0 counts become 1 • Let’s say WordX occurs with skill5 75% of the time and only once with any other class (it’s the best predictor for skill5) • After smoothing, P(WordX|Skill5) = 2/3 • P(WordX|majority) = 2/38

Counts Without Smoothing • 40 math problem texts • 3 of them are skill5 • WordX occurs with skill5 75% of the time and occurs only once with any other class (it’s the best predictor for skill5) WordX WordY 3 Skill5 Majority Class 1

Counts With Smoothing • 40 math problem texts • 3 of them are skill5 • WordX occurs with skill5 75% of the time and occurs only once with any other class (it’s the best predictor for skill5) WordX WordY 4 Skill5 Majority Class 2

Approach 1 • Assume you have 40 example texts, and 3 of them have skill5 associated with them • Assume you are using some form of smoothing – 0 counts become 1 • Let’s say WordX occurs with skill5 75% of the time and only once with any other class (it’s the best predictor for skill5) • After smoothing, P(WordX|Skill5) = 2/3 • P(WordX|majority) = 2/38

Approach 1 • Let’s say WordY occurs 17% of the time with the majority class and 25% of the time with skill5 (it’s a moderately good predictor) • In reality, 9 counts of WordY with majority and 1 with Skill5 • With smoothing, we get 10 counts of WordY with majority and 2 with Skill5 • P(WordY|Skill5) = 1/3 • P(WordY|Majority) = 7/38 • Because you multiply the conditional probabilities and the prior probabilities together, it’s nearly impossible to predict the minority class when the data is this skewed • For “WordX WordY” you would get .66*.33*.04 = .009 for skill5 and .05*.18 *.96 = .009 for majority • What would you predict without smoothing?

Counts Without Smoothing • 40 math problem texts • 4 of them are skill5 • WordY occurs 17% of the time with the majority class and 25% of the time with skill5 (it’s a moderately good predictor) WordX WordY 3 1 Skill5 Majority Class 1 6

Counts With Smoothing • 40 math problem texts • 4 of them are skill5 • WordY occurs 17% of the time with the majority class and 25% of the time with skill5 (it’s a moderately good predictor) WordX WordY 5 2 Skill5 Majority Class 1 7

Approach 1 • Let’s say WordY occurs 17% of the time with the majority class and 25% of the time with skill5 (it’s a moderately good predictor) • In reality, 9 counts of WordY with majority and 1 with Skill5 • With smoothing, we get 10 counts of WordY with majority and 2 with Skill5 • P(WordY|Skill5) = 1/3 • P(WordY|Majority) = 7/38 • Because you multiply the conditional probabilities and the prior probabilities together, it’s nearly impossible to predict the minority class when the data is this skewed • For “WordX WordY” you would get .66*.33*.04 = .009 for skill5 and .05*.18 *.96 = .009 for majority • What would you predict without smoothing?

Linear Models

Remember this: What do concepts look like?

R B S X X T X X X C X Review: Concepts as Lines

R B S X X T X X X C X Review: Concepts as Lines X What will be the prediction for this new data point?

What are we learning? • We’re learning to draw a line through a multidimensional space • Really a “hyperplane” • Each function we learn is like single split in a decision tree • But it can take many features into account at one time rather than just one • F(x) = X0 + C1X1 + C2X2 + C3X3 • X1-Xn are our attributes • C1-Cn are coefficients • We’re learning the coefficients, which are weights

Taking a Step Back • We started out with tree learning a algorithms that learn symbolic rules with the goal of achieving the highest accuracy • 0R, 1R, Decision Trees (J48) • Then we talked about statistical models that make decisions based on probability • Naïve Bayes • Rules look different – we just store counts • No explicit focus on accuracy during learning • What are the implications of the contrast between an accuracy focus and a probability focus?

Performing well with skewed class distributions • Naïve Bayes has trouble with skewed class distributions because of the contribution of prior probabilities • Remember our math problem case • Linear models can compensate for this • They don’t have any notion of prior probability per se • If they can find a good split on the data, they will find it wherever it is • Problem if there is not a good split

Skewed but clean separation

Skewed but no clean separation

Taking a Step Back • The models we will look at now have rules composed of numbers • So they “look” more like Naïve Bayes than like Decision Trees • But the numbers are obtained through a focus on achieving accuracy • So the learning process is more like Decision Trees • Given these two properties, what can you say about assumptions about the form of the solution and assumptions about the world that are made?

Machine Learning in Practice Lecture 6

Machine Learning in Practice Lecture 6

Presentation Transcript

Machine Learning in Practice Lecture 9

Machine Learning in Practice Lecture 3

Machine Learning in Practice Lecture 18

Machine Learning in Practice Lecture 12

Machine Learning in Practice Lecture 19

Machine Learning – Lecture 4

Machine Learning in Practice MidTerm Review

Machine Learning in Practice Lecture 14

Machine Learning in Practice Lecture 7

CS 59000 Statistical Machine learning Lecture 6

Machine Learning in Practice Lecture 5

Machine Learning in Practice Lecture 8

Machine Learning: Lecture 6

Machine Learning: Lecture 5

Machine Learning in Practice Lecture 26

Machine Learning in Practice Lecture 27

Machine Learning in Practice Lecture 7

Machine Learning: Lecture 6