1 / 71

LING / C SC 439/539 Statistical Natural Language Processing

LING / C SC 439/539 Statistical Natural Language Processing. Lecture 6 1/30/2013. Recommended reading. Warning: this lecture is very hand-wavy. Read a book for mathematical details. Hastie Chapters 1 and 2 http://www-stat.stanford.edu/~tibs/ElemStatLearn/ Nilsson Chapter 1

kesler
Download Presentation

LING / C SC 439/539 Statistical Natural Language Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LING / C SC 439/539Statistical Natural Language Processing Lecture 6 1/30/2013

  2. Recommended reading • Warning: this lecture is very hand-wavy. Read a book for mathematical details. • Hastie Chapters 1 and 2 • http://www-stat.stanford.edu/~tibs/ElemStatLearn/ • Nilsson Chapter 1 • http://ai.stanford.edu/~nilsson/mlbook.html • Or “chapter 1” of any machine learning textbook

  3. Outline • Feature space • Classification models • Model selection • Plotting points and discriminant

  4. Review of classification • Formulate classification problem • Instances, labels, features • Split annotated corpus into train/test sets • Using the training set, • Compute features of each instance • Train the classifier on these instances • Output a classifier • Apply to new data: compute features, then apply classifier • Use test set to evaluate performance of classifier Training set Testing set

  5. Represent training data numerically • X: instance-feature matrix, size nx m (n rows, m columns) • ntraining instances Xi • mfeature functions fj • Xi,j = fj(Xi), i.e., the value of feature function fj for Xi • Y: vector of labels, length n Feature functions F Y: labels X: training instances

  6. How do we produce a classifier from the instance-feature matrix and label vector? • You’ll see two broad approaches in this class • 1. Geometric view (starting today) • View instances as points in a multidimensional feature space • Algorithms • Perceptron • Support Vector Machine • Decision Tree • 2. Probability theory • Formulate probabilistic model of the data • Later in the course

  7. Geometric interpretation of training data • Instance-feature matrix X • Size n x m (n instances, m features) • Interpret as n data points in an m-dimensional vector space, called the feature space • Label vector Y • Indicates the label/class of each data point • In this lecture I’ll assume there are 2 classes of data (i.e., binary-valued)

  8. Example: 1-dimensional feature spaceGenerate random points on a number line, from two different classes

  9. 2-dimensional space: generate 50 random points from two classes, with x- and y- values in [-10,10)

  10. Feature values • Real values • Counts { 0, 1, 2, … } • Binary: { 0, 1 } • Categorical: can represent as a set of binary features • Example: F1 є { val1, val2, val3 } • Replace with: F1_val1 є { 0, 1 } F1_val2 є { 0, 1 } F1_val3 є { 0, 1 }

  11. Visualize binary-valued dimensions • 3 dimensions, binary values • All 6 possible data points are shown ( 1, 1, 1 ) ( 0, 1, 1 ) ( 1, 1, 0 ) ( 0, 1, 0 ) ( 0, 0, 1 ) ( 1, 0, 1 ) ( 1, 0, 0 ) ( 0, 0, 0 )

  12. Combination of real and binary values • Here the 3rd dimension is real-valued ( 1, 1, 1 ) ( 0, 1, 1 ) ( 1, 1, 0 ) ( 0, 1, 0 ) ( 0, 0, 1 ) ( 1, 0, 1 ) ( 1, 0, 0 ) ( 0, 0, 0 )

  13. Feature space sizes • Examples so far: 1, 2 or 3 dimensions • Corresponds to 1, 2, or 3 different features • Useful for teaching purposes but not realistic • Natural language feature space sizes are huge • Vocabulary size in the 10,000s • Position-specific features • Conjunctions of features • Combinatorial explosion of N-grams

  14. Outline • Feature space • Classification models • Model selection • Plotting points and discriminant

  15. Classification under a geometric interpretation • The goal of a classifier is to partition the feature space into regions corresponding to the different classes of data • Different types of models: • Polynomial (linear, quadratic, etc.) • Memory-based • Logical combinations of simpler models • Tree

  16. Examples of classification models • Input to training is a set of labeled points (Xi, Yi) • Each Xi is a n-dimensional vector: Xi = (x1, x2, …, xn) • Yi is the label of Xi • Linear model: f(Xi) = w0*1 + w1*x1 + … + wn*xn = Yi • Quadratic model: f(Xi) = w0 + w1*x1 + … + wn*xn + + w21*x12 … + w2 2n-1*xn2 = Yi • Memory-based, “pointwise”: f(Xi) = Yi where (Xi, Yi) is in the training data • Logical combinations: f(Xi) = combination of above models

  17. Equations of lines in 2 dimensions • Let X = (x1 , x2 ). X is a vector of two values x1 and x2, which are coordinate values in a two-dimensional feature space • Suppose f(X) = 3*x1 - 2*x2 – 4 = 0 • More familiar: rename x1 and x2 to x and y, and rewrite equation: 3x – 2y – 4 = 0 -2y = -3x + 4 2y = 3x – 4 y = (3x – 4) / 2 y = 1.5x – 2

  18. Linear modelHere are 3 diff. lines that separate the points into two classesRight of a line: class 1, left of a line: class 2 f(X) = 0 f(X) = 3 f(X) = -1

  19. “Pointwise” model separates data into two classesf(X) = Y where (X, Y) ∈ training data These are the class 2 data points These are the class 1 data points

  20. Different linear models for 2-dimensional dataBelow a line: class 1, above a line: class 2

  21. Higher-order polynomials also separate the points

  22. Pointwise model separates into two classesGreen circle = class 1, Orange circle = class 2 v

  23. Discriminants • A discriminant is a function g(X) that takes a point X and returns a class • Used for classification of new data points • Examples: • Linear discriminant (the Perceptron is an example): • g(X) = w1*x1 + … + wn*xn + w0 = wTx + w0 • If g(X) > 0, assign class 1 • If g(X) <= 0, assign class 2 • Conjunction of linear discriminants

  24. Constant discriminant g(X) = x1 - 3divides number line into two regions g(X) <= 0: assign class 2 g(X) > 0: assign class 1

  25. Constant discriminant g(X) = x1 - 3divides number line into two regions g(X) <= 0: assign class 2 g(X) > 0: assign class 1

  26. Constant discriminant g(X) = x1 - 3categorizes new data points g(X) <= 0: assign class 2 g(X) > 0: assign class 1

  27. Linear discriminant: g(X) = 3*x1 - 2*x2 - 4

  28. Classify new data with the linear discriminant g(X) = 3*x1 - 2*x2 - 4 g(X) = 3*x1 - 2*x2 – 4 <= 0 Example: g(-7, -5) = 3*-7 - 2*-5 - 4 = -15 <= 0 g(X) = 3*x1 - 2*x2 – 4 > 0

  29. Conjunction of linear modelsg(X) = x1 <= 3 and x2 <= .25 (outputs T/F) + + + + + + +

  30. Conjunction of linear modelsg(X) = x1 <= 3 and -14 <= x2and x2 <= .25 and -.75 <= x2 + + + + + + + + + + +

  31. Outline • Feature space • Classification models • Model selection • Plotting points and discriminant

  32. Mostly equivalent terminology • Model • Discriminant • Hypothesis • Decision boundary • Classifier

  33. Issues in model selection • What class of model to choose in the first place • Linear discriminant • Quadratic • Tree • etc. • Choose a specific parameterization • Many possible “hypotheses” for a specific class of model • For example, a linear discriminant can vary in the number of dimensions, and values for the weights g(X) = w1*x1 + … + wn*xn + w0 • How well the model performs

  34. How well the model performs • Multiple issues involved: • Separate points of different classes in training data • Generalize to new data (and also accurately classify this new data) • Balance simplicity of model and fit to data • Noisy data • Separability (is the model complex enough?) • Maximum margin

  35. 1. Want to separate points in training dataExample: linear discriminant NOT GOOD: doesn’t separate the two classes GOOD: separates the two classes of training data

  36. Pointwisediscriminant:g(X) = Y where (X, Y) ∈ training data GOOD: separates the two classes of training data

  37. g(X) = a very complicated function GOOD: separates the two classes of training data

  38. Define a loss function • A loss function quantifies the error made by a model on the training set • Could be as simple as the number of misclassified points • In learning algorithms, parameters of a model are adjusted to minimize the loss function

  39. 2. Model should generalize to new data • We have some problem that we want to model. Sample data: • The training set is (i.e., should be) a representative random sample of data. • The testing set is also a sample of data. Since it’s a sample, it won’t be identical to the training set. • A good model must be able to generalize: perform well on data it has not seen before • Quantify performance: use a test set • Generalization favors simpler models Training set Testing set

  40. What classes should new data points be assigned to?

  41. Pointwisediscriminant does not generalize at allg(X) = Y where (X, Y) ∈ training data This function does not assign a class to data it has not seen before! New data, class 2 New data, class 1

  42. 3. Balance simplicity of modelvs. fit to data • Want to accurately classify training data, such that we gain the ability to • classify new data • Beware of overfitting: if a model is too complex or too tailored to the training data, generalization will suffer • Occam’s Razor: when two models perform equally well, choose the simpler one

  43. Simplicity favors linear discriminant;squiggly discriminant is too complex

  44. One-dimensional discriminant (line parallel to one axis) is too simple to accurately fit the data g2(X): x1 - 0.75 g1(X) = 3*x1 - 2*x2 - 4

  45. Extreme case of overfitting: a model that simply memorizes the training data Fit the training data: very good Simplicity of the model: NO, very complex, must memorize all the training data

  46. 4. Noisy data • Data is often noisy • Reasons for noise: • Measurement error • Mislabelled data • Includes data from a different source • If we modify our model to fit to noise, we may overfit, and perform poorly on new data

  47. Noisy data: suppose the training data contains this point in the “wrong” region

  48. Overfitting: selecting a model that overly conforms to the training set

  49. Overfitting leads to misclassification of new data

  50. Another example of overfitting • Green squiggle: overfits • A good model (black curve) could actually misclassify data from the training set

More Related