LING / C SC 439/539 Statistical Natural Language Processing

LING / C SC 439/539Statistical Natural Language Processing Lecture 6 1/30/2013

Recommended reading • Warning: this lecture is very hand-wavy. Read a book for mathematical details. • Hastie Chapters 1 and 2 • http://www-stat.stanford.edu/~tibs/ElemStatLearn/ • Nilsson Chapter 1 • http://ai.stanford.edu/~nilsson/mlbook.html • Or “chapter 1” of any machine learning textbook

Outline • Feature space • Classification models • Model selection • Plotting points and discriminant

Review of classification • Formulate classification problem • Instances, labels, features • Split annotated corpus into train/test sets • Using the training set, • Compute features of each instance • Train the classifier on these instances • Output a classifier • Apply to new data: compute features, then apply classifier • Use test set to evaluate performance of classifier Training set Testing set

Represent training data numerically • X: instance-feature matrix, size nx m (n rows, m columns) • ntraining instances Xi • mfeature functions fj • Xi,j = fj(Xi), i.e., the value of feature function fj for Xi • Y: vector of labels, length n Feature functions F Y: labels X: training instances

How do we produce a classifier from the instance-feature matrix and label vector? • You’ll see two broad approaches in this class • 1. Geometric view (starting today) • View instances as points in a multidimensional feature space • Algorithms • Perceptron • Support Vector Machine • Decision Tree • 2. Probability theory • Formulate probabilistic model of the data • Later in the course

Geometric interpretation of training data • Instance-feature matrix X • Size n x m (n instances, m features) • Interpret as n data points in an m-dimensional vector space, called the feature space • Label vector Y • Indicates the label/class of each data point • In this lecture I’ll assume there are 2 classes of data (i.e., binary-valued)

Example: 1-dimensional feature spaceGenerate random points on a number line, from two different classes

2-dimensional space: generate 50 random points from two classes, with x- and y- values in [-10,10)

Feature values • Real values • Counts { 0, 1, 2, … } • Binary: { 0, 1 } • Categorical: can represent as a set of binary features • Example: F1 є { val1, val2, val3 } • Replace with: F1_val1 є { 0, 1 } F1_val2 є { 0, 1 } F1_val3 є { 0, 1 }

Visualize binary-valued dimensions • 3 dimensions, binary values • All 6 possible data points are shown ( 1, 1, 1 ) ( 0, 1, 1 ) ( 1, 1, 0 ) ( 0, 1, 0 ) ( 0, 0, 1 ) ( 1, 0, 1 ) ( 1, 0, 0 ) ( 0, 0, 0 )

Combination of real and binary values • Here the 3rd dimension is real-valued ( 1, 1, 1 ) ( 0, 1, 1 ) ( 1, 1, 0 ) ( 0, 1, 0 ) ( 0, 0, 1 ) ( 1, 0, 1 ) ( 1, 0, 0 ) ( 0, 0, 0 )

Feature space sizes • Examples so far: 1, 2 or 3 dimensions • Corresponds to 1, 2, or 3 different features • Useful for teaching purposes but not realistic • Natural language feature space sizes are huge • Vocabulary size in the 10,000s • Position-specific features • Conjunctions of features • Combinatorial explosion of N-grams

Classification under a geometric interpretation • The goal of a classifier is to partition the feature space into regions corresponding to the different classes of data • Different types of models: • Polynomial (linear, quadratic, etc.) • Memory-based • Logical combinations of simpler models • Tree

Examples of classification models • Input to training is a set of labeled points (Xi, Yi) • Each Xi is a n-dimensional vector: Xi = (x1, x2, …, xn) • Yi is the label of Xi • Linear model: f(Xi) = w0*1 + w1*x1 + … + wn*xn = Yi • Quadratic model: f(Xi) = w0 + w1*x1 + … + wn*xn + + w21*x12 … + w2 2n-1*xn2 = Yi • Memory-based, “pointwise”: f(Xi) = Yi where (Xi, Yi) is in the training data • Logical combinations: f(Xi) = combination of above models

Equations of lines in 2 dimensions • Let X = (x1 , x2 ). X is a vector of two values x1 and x2, which are coordinate values in a two-dimensional feature space • Suppose f(X) = 3*x1 - 2*x2 – 4 = 0 • More familiar: rename x1 and x2 to x and y, and rewrite equation: 3x – 2y – 4 = 0 -2y = -3x + 4 2y = 3x – 4 y = (3x – 4) / 2 y = 1.5x – 2

Linear modelHere are 3 diff. lines that separate the points into two classesRight of a line: class 1, left of a line: class 2 f(X) = 0 f(X) = 3 f(X) = -1

“Pointwise” model separates data into two classesf(X) = Y where (X, Y) ∈ training data These are the class 2 data points These are the class 1 data points

Different linear models for 2-dimensional dataBelow a line: class 1, above a line: class 2

Higher-order polynomials also separate the points

Pointwise model separates into two classesGreen circle = class 1, Orange circle = class 2 v

Discriminants • A discriminant is a function g(X) that takes a point X and returns a class • Used for classification of new data points • Examples: • Linear discriminant (the Perceptron is an example): • g(X) = w1*x1 + … + wn*xn + w0 = wTx + w0 • If g(X) > 0, assign class 1 • If g(X) <= 0, assign class 2 • Conjunction of linear discriminants

Constant discriminant g(X) = x1 - 3divides number line into two regions g(X) <= 0: assign class 2 g(X) > 0: assign class 1

Constant discriminant g(X) = x1 - 3categorizes new data points g(X) <= 0: assign class 2 g(X) > 0: assign class 1

Linear discriminant: g(X) = 3*x1 - 2*x2 - 4

Classify new data with the linear discriminant g(X) = 3*x1 - 2*x2 - 4 g(X) = 3*x1 - 2*x2 – 4 <= 0 Example: g(-7, -5) = 3*-7 - 2*-5 - 4 = -15 <= 0 g(X) = 3*x1 - 2*x2 – 4 > 0

Conjunction of linear modelsg(X) = x1 <= 3 and x2 <= .25 (outputs T/F) + + + + + + +

Conjunction of linear modelsg(X) = x1 <= 3 and -14 <= x2and x2 <= .25 and -.75 <= x2 + + + + + + + + + + +

Mostly equivalent terminology • Model • Discriminant • Hypothesis • Decision boundary • Classifier

Issues in model selection • What class of model to choose in the first place • Linear discriminant • Quadratic • Tree • etc. • Choose a specific parameterization • Many possible “hypotheses” for a specific class of model • For example, a linear discriminant can vary in the number of dimensions, and values for the weights g(X) = w1*x1 + … + wn*xn + w0 • How well the model performs

How well the model performs • Multiple issues involved: • Separate points of different classes in training data • Generalize to new data (and also accurately classify this new data) • Balance simplicity of model and fit to data • Noisy data • Separability (is the model complex enough?) • Maximum margin

1. Want to separate points in training dataExample: linear discriminant NOT GOOD: doesn’t separate the two classes GOOD: separates the two classes of training data

Pointwisediscriminant:g(X) = Y where (X, Y) ∈ training data GOOD: separates the two classes of training data

g(X) = a very complicated function GOOD: separates the two classes of training data

Define a loss function • A loss function quantifies the error made by a model on the training set • Could be as simple as the number of misclassified points • In learning algorithms, parameters of a model are adjusted to minimize the loss function

2. Model should generalize to new data • We have some problem that we want to model. Sample data: • The training set is (i.e., should be) a representative random sample of data. • The testing set is also a sample of data. Since it’s a sample, it won’t be identical to the training set. • A good model must be able to generalize: perform well on data it has not seen before • Quantify performance: use a test set • Generalization favors simpler models Training set Testing set

What classes should new data points be assigned to?

Pointwisediscriminant does not generalize at allg(X) = Y where (X, Y) ∈ training data This function does not assign a class to data it has not seen before! New data, class 2 New data, class 1

3. Balance simplicity of modelvs. fit to data • Want to accurately classify training data, such that we gain the ability to • classify new data • Beware of overfitting: if a model is too complex or too tailored to the training data, generalization will suffer • Occam’s Razor: when two models perform equally well, choose the simpler one

Simplicity favors linear discriminant;squiggly discriminant is too complex

One-dimensional discriminant (line parallel to one axis) is too simple to accurately fit the data g2(X): x1 - 0.75 g1(X) = 3*x1 - 2*x2 - 4

Extreme case of overfitting: a model that simply memorizes the training data Fit the training data: very good Simplicity of the model: NO, very complex, must memorize all the training data

4. Noisy data • Data is often noisy • Reasons for noise: • Measurement error • Mislabelled data • Includes data from a different source • If we modify our model to fit to noise, we may overfit, and perform poorly on new data

Noisy data: suppose the training data contains this point in the “wrong” region

Overfitting: selecting a model that overly conforms to the training set

Overfitting leads to misclassification of new data

Another example of overfitting • Green squiggle: overfits • A good model (black curve) could actually misclassify data from the training set

LING / C SC 439/539 Statistical Natural Language Processing