Intro to Machine Learning

Intro to Machine Learning Parameter Estimation for Bayes Nets Naïve Bayes

Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets Inference Mechanism: A*, variable elimination, Gibbs sampling Learning Mechanism: For today! ------------------------------------- Evaluation Metric

Machine Learning Determining a model for how something works, based on examples of it working (data). This is a very general problem, that has lots of applications. Many big companies today are getting rich by doing this well.

Quiz: Companies doing ML For each company below, think of a type of data that the company can use to learn something useful.

Answer: Companies doing ML For each company below, think of a type of data that the company can use to learn something useful.

ML in Research Systems • DARPA Grand Challenge Some of my own research: Learning to understand English sentences “Secretary of Energy Steven Chu announced Friday that he was resigning pending the confirmation of a successor.” System can predict: • announced is an action • Secretary of Energy Steven Chu is a person who is doing the action • Friday is a date/time describing when the action happened • he was resigning pending the confirmation of a successor is the thing being announced. Very few AI systems today have no learning component.

Example: Parameter Estimation in BNs Sunny? Raise? Recall this BN from before. Let’s pretend now that none of the parameters were given to you. Happy?

Example: Parameter Estimation in BNs Sunny? Raise? How can we figure out what these parameters should be? The ML answer: 1. Collect some data 2. Find parameters that explain this data Happy?

Example: Parameter Estimation in BNs Sunny? Raise? Example Data 1. +s, -r, +h 2. +s, +r, +h 3. +s, -r, +h 4. -s, -r, -h 5. +s, -r, +h 6. -s, -r, -h Happy?

Quiz: Parameter Estimation in BNs Sunny? Raise? Example Data 1. +s, -r, +h 2. +s, +r, +h 3. +s, -r, +h 4. -s, -r, -h 5. +s, -r, +h 6. -s, -r, -h Happy? Given the data above, what would you estimate for P(+s) = P(+r) = P(+h | +s, -r) =

Answer: Parameter Estimation in BNs Sunny? Raise? Example Data 1. +s, -r, +h 2. +s, +r, +h 3. +s, -r, +h 4. -s, -r, -h 5. +s, -r, +h 6. -s, -r, -h Happy? Given the data above, what would you estimate for P(+s) = 4 / 6 = 0.67 P(+r) = 1 / 6 = 0.167 P(+h | +s, -r) = 2 / 3 = 0.67

Maximum Likelihood Parameter Estimation To estimate a parameter P(X1=a1, …, XN=aN | Y1=b1, …, YM=bM) Maximum Likelihood Estimation (MLE) Algorithm: • Cjoint = Count how many times (X1=a1, …, XN=aN, Y1=b1, …, YM=bM) appears in the dataset. • Cmarginal = Count how many times (Y1=b1, …, YM=bM) appears in the dataset • Set the parameter = Cjoint / Cmarginal

Quiz: MLE What’s the difference between MLE and rejection sampling?

Answer: MLE What’s the difference between MLE and rejection sampling? The parameter estimation procedure is the same, but rejection sampling gets its samples by generating them from the Bayes Net. This requires knowing the parameters of the BN. MLE gets its samples from some external source.

Where does data come from? This is a fundamental practical consideration for machine learning. The answer: wherever you can get it the easiest. Some examples: • For medical diagnosis ML systems, the system needs examples of X-ray images that are labeled with a diagnosis, e.g. “bone broken” or “bone not broken”. Typically, this data needs to be gotten from a “human expert”, in this case a doctor trained in radiology. These people’s time is EXPENSIVE, so there’s usually not a lot of this data available. • For speech recognition ML systems, the system needs examples of speech recordings (audio files), labeled with the corresponding English words. You can pay users of Amazon’s Mechanical Turk a couple of pennies per example to label these audio files. • “Language models” are systems that are really important for processing human language. These systems try to predict the next word in a sequence of words. These systems need examples of English sentences. There are billions of such sentences available on the Web and other places; to get these, you just need to write some software to crawl the Web and grab the sentences.

Likelihood Likelihood is a term to refer to the following probability: P(D | M), where D is your data, and M is your model (in our case, a Bayes Net). When the data consists of multiple examples, most often (but not always) ML assumes that these examples are independent. This means we can re-write the likelihood like this: P(d1, …, dk | M) = P(d1 | M) P(d2 | M)  P(dk | M)

Quiz: Likelihood Sunny? Raise? Example Data 1. +s, -r, +h 2. +s, +r, +h Happy? What is the likelihood of this data, given this BN?

Answer: Likelihood Sunny? Raise? Example Data 1. +s, -r, +h 2. +s, +r, +h Happy? What is the likelihood of this data, given this BN? Likelihood is P(D | BN) = P(d1 | BN) * P(d2 | BN) = P(+s, -r, +h) * P(+s, +r, +h) = P(+s)P(-r)P(+h | +s, -r) * P(+s)P(+r)P(+h | +s, +r) = .7*.99*.7 * .7*.01*1 = .0034

Maximum Likelihood “Maximum Likelihood Estimation” is called that because the parameters it finds for dataset D are the parameters that make P(D | BN) biggest. Let m be the maximum likelihood estimate for P(S). P(D | BN) = m * P(-r) * P(-h|+s, -r) * m * P(+r) * P(+h | +s, +r) * m * P(-r) * P(+h | +s, -r) * (1-m) * P(-r) * P(-h | -s, -r) * m * P(-r) * P(+h | +s, -r) * (1-m) * P(-r) * P(-h | -s, -r) Data 1. +s, -r, +h 2. +s, +r, +h 3. +s, -r, +h 4. -s, -r, -h 5. +s, -r, +h 6. -s, -r, -h

Maximum Likelihood To find the largest point of P(D | BN), we’ll take the derivative:

Maximum Likelihood To find the largest point of P(D | BN), we’ll set the derivative equal to zero: Notice: This is the same value that you got by doing the MLE algorithm!

More typical ML example: Spam Detection Dear Anita, We love our customers! To show our appreciation, Pita Delite is happy to announce the $5.99 Meal Deal. For only $5.99, get a sandwich (hot or cold), one side and a drink at Pita Delite. Offer valid in the Greensboro and High Point locations until February 28, 2013. We hope to see you soon! The Pita Delite Team PS Like us Facebook and follow us on Twitter for special offers. ...and here's a gift from us to you. Enter the discount code LOVE2LOVE during the checkout and get 10% off your purchase. I'm sorry for how late this email is. I was planning on getting youinformation sooner but I have been very busy. Here is my resume, itshould be enough information about me that would be useful in a letterof recommendation. Thank you for doing this. SPAM HAM! SPAM

Email users supply labeled data X X f X Y

Building a classifier, Step 1 Text is complicated, and systems aren’t good enough yet to be able to understand it. Step 1 is to simplify the X variable to something that is computationally easy to handle.

“Bag of Words” Representation • Construct a dictionary, which contains the set of distinct words in all of your examples. Dear Anita, We love our customers! To show our appreciation, Pita Delite is happy to announce the $5.99 Meal Deal. … Dictionary dear anita we love our customers to show appreciation pita delite is happy announce the $5.99 meal deal and here’s a gift from us you enter discount code love2love during checkout get 10% off your purchase i’m sorry for how late this email i was planning on getting information sooner but have been busy ...and here's a gift from us to you. Enter the discount code LOVE2LOVE during the checkout and get 10% off your purchase. I'm sorry for how late this email is. I was planning on getting youinformation sooner but I have been busy. …

“Bag of Words” Representation 2. For each email, for each word w in the dictionary, count how many times w appears in the email. Dear Anita, We love our customers! To show our appreciation, Pita Delite is happy to announce the $5.99 Meal Deal. …

“Bag of Words” Representation 2. For each email, for each word w in the dictionary, count how many times w appears in the email. ...and here's a gift from us to you. Enter the discount code LOVE2LOVE during the checkout and get 10% off your purchase.

“Bag of Words” Representation 2. For each email, for each word w in the dictionary, count how many times w appears in the email. I'm sorry for how late this email is. I was planning on getting youinformation sooner but I have been busy. …

“Bag of Words” Representation X now consists of a number of numerical “features” or “attributes”, X1 up to XN. You can think of each of these features as an observable random variable. We’ll use these features to construct the classifier.

Quiz: Bag of words Below are three (contrived) email messages. Construct a bag of words representation for each.

Quiz: Bag of words (answer) Below are three (contrived) email messages. Construct a bag of words representation for each.

Quiz: Bayesian spam classifier If you had to come up with a Bayes Net to predict which of these messages was Spam, what would it look like?

Answer: Bayesian spam classifier If you had to come up with a Bayes Net to predict which of these messages was Spam, what would it look like? Lots of possible answers for this, I’ll show a common kind of BN used for this next.

Building a classifier, Step 2 Once you’ve got a set of features for your examples, it’s time to decide on what type of classifier you’d like to use. Technically, this is called choosing a hypothesis space – a set (or “space”) of possible classifiers (or “hypotheses”). Bayes Nets can make fine classifiers. However, the space of ALL Bayes Nets is too big for building a good spam detector. We’re going to restrict our attention to a special class of Bayes Nets called Naïve Bayes models.

Naïve Bayes Classifier Naïve Bayes is a simple and widely-used model in ML for many different problems. It is a Bayes Net with one parent node and N children. The children are typically observable, and the parent is typically unobservable. Y … X1 X2 XN … Notice the conditional independence assumption: Each Xi is conditionally independent of every Xj, given Y.

Learning a Naïve Bayes Classifier Parameter Estimation for NBCs are the same as for other BNs. To simplify our problem, we’ll assume all Xi variables are boolean (1 or 0).

Quiz: Learning a Naïve Bayes Classifier How many parameters do we need to learn for our NBC for spam detection? If we use MLE, what parameter would be learned for P(+spam)? How about for P(+energy | +spam)? How about for P(+will | +ham)?

Answer: Learning a Naïve Bayes Classifier How many parameters do we need to learn for our NBC for spam detection? 19 = 1 (+spam) + 9 (+word | +spam) + 9 (+word | +ham) If we use MLE, what parameter would be learned for P(+spam)? 2/3 How about for P(+energy | +spam)? 2/2 = 1.0 How about for P(+will | +ham)? 1/2 = 0.5

Quiz: Prediction with an NBC Spam What is P(Spam | “sports fans love this”)? … I love this …

Overfitting Overfitting occurs when a statistical model (aka, a “classifier” in ML) describes random error or noise instead of the underlying relationship. Overfitting generally occurs when a model is excessively complex, such as having too many parameters relative to the number of observations. A model which has been overfit will generally have poor predictive performance, as it can exaggerate minor fluctuations in the data.

Overfitting our NBC Spam Our model has overfit. For instance, it believes there is ZERO CHANCE of seeing “I” in a spam message. This is true in the 3 training messages, but it’s too strong of a conclusion to make from just 3 training examples. It leads to poor predictions on new examples, such as P(+spam | “I love energy drink”) … I love this …

Laplace Smoothing For binary variable X, MLE from N examples: Laplace-smoothed estimate: pretend we start with 2K (fake) examples, half with +x and half with –x.

Quiz: Laplace smoothing Let K=1. Assume our training data contains 1 example, of which 1 is +spam. P(+spam)=? 10 examples, 4 of which are +spam. P(+spam)=? 100 examples, 40 of which are +spam. P(+spam)=? 1000 examples, 400 of which are +spam. P(+spam)=?

Answers: Laplace smoothing Let K=1. Assume our training data contains 1 example, of which 1 is Spam. P(Spam)=(Count(spam)+1) / (N+2) = (1+1) / (1+2)= 2/3 10 examples, 4 of which are Spam. P(Spam)=(4+1) / (10+2) = 5/12 = 0.41666667 100 examples, 40 of which are Spam. P(Spam)= (40+1) / (100+2) = 41/102 = 0.401961 1000 examples, 400 of which are Spam. P(Spam)= (400+1) / 1000+2) = 401/1002 = 0.4001960 As the number of training examples increases, the Laplace smoothing has a smaller and smaller effect. It’s only when there’s not much training data that it has a big effect.

Quiz: Laplace Smoothing Spam Fill in the parameters using Laplace smoothing, with K=1. … I love this …

Answers: Laplace Smoothing Spam Fill in the parameters using Laplace smoothing, with K=1. … I love this …

Quiz: Laplace Smoothing Spam What is P(+spam | “sports fans love this”)? … I love this …

Intro to Machine Learning