Empirical Research Methods in Computer Science

Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith

Using Data Data estimation; regression; learning; training Model classification; decision pattern classification machine learning statistical inference ... Action

Probabilistic Models • Let X and Y be random variables. (continuous, discrete, structured, ...) • Goal: predict Y from X. • A model defines P(Y = y | X = x). • Where do models come from? • If we have a model, how do we use it?

Using a Model • We want to classify a message, x, as spam or mail: y ε {spam, mail}. Model P(spam | x) P(mail | x) x

Bayes’ Rule likelihood: one distribution over complex observations per y prior what we said the model must define normalizes into a distribution:

Naive Bayes Models • Suppose X = (X1, X2, X3, ..., Xm). • Let

Naive Bayes: Graphical Model Y X1 X2 X3 Xm ...

Part II Where do the model parameters come from?

Using Data Data estimation; regression; learning; training Model Action

Warning • This is a HUGE topic. • We will barely scratch the surface.

Forms of Models • Recall that a model defines P(x | y) and P(y). • These can have a simple multinomial form, like P(mail) = 0.545, P(spam) = 0.455 • Or they can take on some other form, like a binomial, Gaussian, etc.

Example: Gaussian • Suppose y is {male, female}, and one observed variable is H, height. • P(H | male) ~ N(μm, σm2) • P(H | female) ~ N(μf, σf2) • How to estimate μm, σm2,μf, σf2?

Maximum Likelihood • Pick the model that makes the data as likely as possible max P(data | model)

Maximum Likelihood (Gaussian) • Estimating the parameters μm, σm2,μf, σf2 can be seen as • fitting the data • estimating an underlying statistic (point estimate)

Using the model

Example: Regression • Suppose y is actual runtime, and x is input length. • Regression tries to predict some continuous variables from others.

Regression • Linear: assume linear relationship, fit a line. • We can turn this into a model!

Linear Model • Given x, predict y. y = β1x + β0+ N(0, σ2) random deviation true regression line

Principle of Least Squares • Minimize the sum of squared vertical deviations. • Unique, closed form solution! vertical deviation

Other kinds of regression • transform one or both variables (e.g., take a log) • polynomial regression • (least squares → linear system) • multivariate regression • logistic regression

Example: text categorization • Bag-of-words model: • x is a histogram of counts for all words • y is a topic

MLE for Multinomials • “Count and Normalize”

The Truth about MLE • You will never see all the words. • For many models, MLE isn’t safe. • To understand why, consider a typical evaluation scenario.

Evaluation • Train your model on some data. • How good is the model? • Test on different data that the system never saw before. • Why?

Tradeoff low variance overfits the training data doesn’t generalize low accuracy

Text categorization again • Suppose ‘v1@gra’ never appeared in any document in training, ever. • What is the above probability for a new document containing ‘v1@gra’ at test time?

Solutions • Regularization • Prefer less extreme parameters • Smoothing • “Flatten out” the distribution • Bayesian Estimation • Construct a prior over model parameters, then train to maximize P(data | model) × P(model)

One More Point • Building models is not the only way to be empirical. • Neural networks, SVMs, instance-based learning • MLE and smoothed/Bayesian estimation are not the only ways to estimate. • Minimize error, for example (“discriminative” estimation)

Assignment 3 • Spam detection • We provide a few thousand examples • Perform EDA and pick features • Estimate probabilities • Build a Naive-Bayes classifier

Empirical Research Methods in Computer Science