Dan Roth University of Illinois, Urbana-Champaign danr@cs.uiuc L2R.cs.uiuc/~danr

Tutorial onMachine Learning in Natural Language Processingand Information Extraction 3.5 Summary and Introduction • What we did • Issues in NLP & information Access • Classification: Practice • Algorithms • Features • Outline • Probabilistic; Semi-Supervised; Structures Dan Roth University of Illinois, Urbana-Champaign danr@cs.uiuc.edu http://L2R.cs.uiuc.edu/~danr

What we did • Presented an introduction to problems in Processing Natural Language • Motivated a Machine Learning based approach • Variability and Ambiguity • Discussed a general Paradigm for Learning Classifiers

Tools • A collection of tools that are essential for any intelligent use of text… • Robust text analysis tools • Segmentation; Tokenization; POS tagging; Shallow parsing • Name Entity Classifiers • people; locations; organizations; transportation; materials… • Information Extraction • functional phrases (e.g., job descriptions; acquisitions) • Relations/Event recognizers • born_in(A,B); capital_of(C,D); killed(A,B) • Categorization tools: Topics, Sentiments, …. • On-the-fly categorization; email; documents about “American Politics” • Information Integration • Text and Databases

How do we get there?

Classification: Ambiguity Resolution Illinois’ bored of education [board] Nissan Car and truck plant; plant and animal kingdom (This Art) (can N) (will MD) (rust V) V,N,N The dog bit the kid. Hewas taken to a veterinarian; a hospital Tiger was in Washington for the PGA Tour  Finance; Banking; World News; Sports Important or not important; love or hate

Classification • The goal is to learn a function f: X Ythat maps observations in a domain to one of several categories. • Task:Decide which of {board ,bored } is more likely in the given context: • X: some representation of: TheIllinois’ _______ of education met yesterday… • Y: {board ,bored } • Typical learning protocol: • Observe a collection of labeled examples (x,y) 2 X £ Y • Use it to learn a function f:XY that is consistent with the observed examples, and (hopefully) performs well on new, previously unobserved examples.

Classification is Well Understood • Theoretically: generalization bounds • How many example does one need to see in order to guarantee good behavior on previously unobserved examples. • Algorithmically: good learning algorithms for linear representations. • Can deal with very high dimensionality (106 features) • Very efficient in terms of computation and # of examples. On-line. • Key issues remaining: • Learning protocols: how to minimize interaction (supervision); how to map domain/task information to supervision; semi-supervised learning; active learning; ranking. • What are the features? No good theoretical understanding here.

Disambiguation Problems Middle Eastern ____ are known for their sweetness Task:Decide which of { deserts , desserts } is more likely in the given context. Ambiguity:modeled as confusion sets (class labels C) C={ deserts, desserts} C={ Noun,Adj.., Verb…} C={ topic=Finance, topic=Computing} C={ NE=Person, NE=location} C={ opinion =Like, opinion=Hate}

Disambiguation Problems • Archetypical disambiguation problem • Data is available (?) • In principle, a solved problem Golding&Roth(96,99), Carlson, Roth , Rosen, (00) Mangu&Brill (01),… • But Many issues are involved in making an “in principle” solution a realistic one

Learning to Disambiguate • Given  a confusion set C={ deserts, desserts}  sentence (s) Middle Eastern ____ are known for their sweetness • Map into a feature based representation • Learn a function FC that determines which of C={ deserts, desserts} more likely in a given context. • Evaluate the function on future C sentences How to Learn?

You've Got Many Choices • Learning Protocols • Supervised, Unsupervised, Semi-Supervised • Probabilistic Classifiers; Discriminative Classifiers • Supervised Classification • Weka • Nearest Neighbor • Support Vector Machines • Decision Trees • ... • SVM-light; Lib-SVM • Support Vector Machines • SNOW • Naïve Bayes • Regularized averaged Perceptron/Winnow • Learning Based Java (LBJ): supports system building with learning components Which one should you choose?

Issues • Accuracy: • They are not that different • Choosing features is important • Some offer tools and more flexibility • Tuning parameters is very important • Training/Testing speed is important • Training: mostly for developmental/experimental efficiency • There could be 1-2 order of magnitude differences

Accuracy • Models • Naïve Bayes Classifier • Generative classifier: P(y,x)‏ • Perceptron • Winnow • Logistic Regression • Support Vector Machine • All linear classifiers! • Multiple ways to obtain the separating hyper plane • Not that different • Why? • Speed, Theory, Different requirements,...

Accuracy (cont.)‏ • Discriminative vs. Generative Classifiers [Roth98,99, Ng and Jordan 01] • When number of labeled instances is small • Generative Model (naïve Bayes) is better • When number of labeled instances is large •  Discriminative Models are better (note: output can be interpreted probabilistically) • Support Vector Machines • Perceptron • Winnow • Logistic regression • Usually no big difference between discriminative models • Usually < 1-3 % depending on tuning • Usually you want to use discriminative models

Features • Features Vectors: Your object (sentence; document) representation • Determined by what you think drives the abstraction • This is VERY important • More important than the algorithms you use • For example, • Name entity classification: First character is capitalized or not • Documentation Classification: Bag of words approach is usually good • NLP applications • Need to develop good features • Can use existing tools for feature extraction • Use LBJ

Tuning Parameters • Very important • Key to achieve the “optimal” performance • Different learning methods perform similarly • Only if the good parameters are used for each classifier • Held-out dataset • Train on one dataset , test on another dataset • (averaged; regularized) Perceptron • Not too many parameters to tune.. • Thick separate (regularization term) (-S) 0 0.5 1 1.5 2 • Number of iteration (-t) 10 20 30... • Learning Rate: 0.1 0.01 0.001 0.0001 • +V (average perceptron): always use it!

Speed • Training Speed is very important! • You do not want to wait 4 weeks to finish training your model • You need to tune parameters • Trying many parameters with slow training algorithms can be a nightmare • Testing speed is very important • Evaluation time is typically not a problem • The key issue here is the time to extract features • This is sometime not taken into account in experiments and is NEVER mentioned in research papers. • Why?

Conclusion • SNoW is be a good candidate • Support naïve Bayes and (average, regularized) Perceptron • Very efficient • Usually much faster than using SVM • Feature Extraction support (Fex)‏ • Voted Perceptron or Winnow are usually the one you want • -V option • If you use Java, try LBJ • Other packages • There are other good packages. Make sure they are fast enough. • Make sure they support sparse representations well • Selecting good features and tuning parameters is key.

More on Representation S= I don’t know whether to laugh or cry [x x x x] Consider words, pos tags, relative location in window Two issues: - Technical issue of how to represent - Deeper issue of expressivity

Whether Weather Functions Can be Made Linear y3Ç y4Ç y7 New discriminator is functionally simpler x1 x2 x4Ç x2 x4 x5Ç x1 x3 x7 Space: X= x1, x2,…, xn Input Transformation New Space: Y = {y1,y2,…} = {xi,xi xj, xi xj xj}

Feature Space • Data are not separable in one dimension • Not separable if you insist on using a specific class of functions x

Blown Up Feature Space • Data are separable in <x, x2> space Key issue: what features to use. Computationally, can be done implicitly (kernels) But there are warnings. x2 x

Learning Approach: Representation S= I don’t know whether to laugh or cry [x x x x] Consider words, pos tags, relative location in window Generate binary features representing presence of: a word/pos within window around target word don’twithin +/-3know within +/-3 Verb at -1 to within +/- 3laugh within +/-3 to a +1 conjunctions of size 2, within window of size 3 words:know__to; ___to laugh pos+words:Verb__to; ____to Verb

Learning Approach: Representation S= I don’t know whether to laugh or cry Is represented as a set of its active features S= (don’tat -2 , knowwithin +/-3,… ____to Verb,...) Label= the confusion set element that occurs in the text Hope: S=I don’t carewhether to laugh or cry hasalmostthe same representation • When you define features, you actually define TYPES of features (E.g., I care about “words” in my documents.). • Only when data is observed, these TYPES become real features that are used in the representation of your objects

Notes on Representation • There is a huge number of potential features (~105). • Out of these – only a small number is actually active in each example. • The representation can be significantly smaller if we list only features that are active in each examples. • Some algorithms can take this into account. Some cannot. (Later).

Notes on Representation (2) • Formally: A feature =a characteristic function over sentences • When the number of features is fixed, the collection of all examples is • When we do not want to fix the number of features (very large number, on-line algorithms,…) can work in the infinite attribute domain • Several learning algorithms (e.g., SNoW)are designed to support variable size examples.

Mohammed Atta met with an Iraqi intelligence agent in Prague in April 2001. meeting participant participant person person name(“Mohammed Atta”) gender(male) location time nationality affiliation location country organization country name(“Czech Republic”) city date month(April) year(2001) name(Iraq) name(Prague) Attributes (node labels) Roles (edge labels) Dealing with Structures Learn this Structure (Many dependent Classifiers; Finding best coherent structure  INFERENCE) Extract Features from this structure  INFERENCE begin end ... before ... before before before after after after after word(an) tag(DT) word(Iraqi) tag(JJ) word(intelligence) tag(NN)

Mohammed Atta met with an Iraqi intelligence agent in Prague in April 2001. meeting participant participant person person name(“Mohammed Atta”) gender(male) location time nationality affiliation location country organization country name(“Czech Republic”) city date month(April) year(2001) name(Iraq) name(Prague) Attributes (node labels) Roles (edge labels) Output Data begin end ... before ... before before before after after after after word(an) tag(DT) word(Iraqi) tag(JJ) word(intelligence) tag(NN)

afternoon, Dr. Ab C …in Ms. De. F class.. join Word= POS= IS-A= … will as board a John director the Structured Input S = John will join the board as a director [NP Which type] [PP of ] [NP submarine] [VP was bought ] [ADVPrecently ] [PP by ] [NP South Korea ] (. ?) Knowledge Representation

Learning From Structured Input • We want to extract features from structureddomain elements • their internal (hierarchical) structure should be encoded. • A feature is a mapping from the instances space to {0,1} or [0,1] • With appropriate representation language it is possible to represent expressive features that constitute infinite dimensional space [FEX] • Learning can be done in the infinite attribute domain. • What does it mean to extract features? • Conceptually: different data instantiations may be abstracted to yield the same representation (quantified elements) • Computationally: Some kind of graph matching process • Challenge: • Provide the expressivity necessary to deal with large scale and highly structured domains • Meet the strong tractability requirements for these tasks.

Example • Only those descriptions that are ACTIVE in the input are listed • Michael Collins developed kernels over parse trees. • Cumby/Roth developed parameterized kernels over structures. • When is it better to use kernel vs. using the primal representation. D = (AND word (before tag)) Explicit features

Semantic Parse (Semantic Role Labeling) Screen shot from a CCG demo http://L2R.cs.uiuc.edu/~cogcomp Semantic parsing reveals several relations in the sentence along with their arguments. This approach produces a very good semantic parser. F1~90% Easy and fast: ~7 Sent/Sec (using Xpress-MP) Top ranked system in CoNLL’05 shared task Key difference is the Inference

This Short Course • There are many topics to cover. • A very active field of research, many ideas are floating around, most of which will not stay. • Rather than covering problems - we will cover some of the main ideas and techniques. • Attempt to abstract away from specific works and understand the main paradigms. • Move towards: Beyond Classification • Knowledge Representation and Inference • Let me know if you have specific interests

This Course • Problems • Language Models • Classification • Context Sensitive Spelling • Inferring Sequential Structure • (Semantic) Parsing • Inference • Summarization? • Entailment? • Representation-less Approaches • Statistics • Paradigms • Generative and Discriminative • Understanding why things work • Classification: Learning Algorithms • Generative and Discriminative algorithms • The ubiquitous Linear Representation • Features and Kernels • Semi-supervised Learning and EM • Structured Prediction and Inference • Generative models; Conditional Models • Inference with Classifiers • Constraint satisfaction • Structural Mappings (translation) • Applications

Questions?

More Detailed Plan (I) • Introduction to Natural Language Learning • Why is it difficult ? • Statistics vs. Learning • When do we need learning? • Examples of problems • Statistics and Information Theory • Corpus based work: data and tasks. • Learning Paradigms • PAC Learning • Bayesian Learning • Examples

More Detailed Plan (II) • Learning Algorithms • Examples • General Paradigm: feature based representation • Linear functions • On line algorithms: additive/multiplicative update • Decision Lists; TBL • Memory Based • Bayesian Methods • Naïve Bayes • HMMs (Predictions and model learning) • Max Entropy • LSQ: why do probabilistic algorithms work?

More Detailed Plan (III) • Relaxing Supervision • EM • Semi supervised learning: co-learning vs. selection • Inference: Sequential Models • HMMs (Predictions and model learning) • HMMs (with Classifiers), PMMs • Constraint Satisfaction/ ad hoc methods • Inference: Complex Models • Inference as constrained optimization • Parsing • Structural Mapping • Generative vs Discriminative • Projects

The Course Plan • Introduction • Why Learning; Learning vs. Statistic • Discriminatory Algorithm • Classification & Inference • Linear Learning Algorithms • Representation-Less Approaches • Statistics & Information Theory Verb Classifications ? MultiWords ? • Features • Feature extraction languages • Kernels (over structures) • Learning Paradigms • Theory of Generalization • Generative Models • LSQ ( Probabilistic Approaches Work) • Learning Structured Representations Inference: putting things together • Representation & Inference • Sequential and General structures • Structure Mapping • Power of Generative Models • Modeling • HMM, K-Means, EM; Semi-Sup;ME Parsing; NE Sequential Structures Story Comprehension/Entailment

Dan Roth University of Illinois, Urbana-Champaign danr@cs.uiuc L2R.cs.uiuc/~danr

Dan Roth University of Illinois, Urbana-Champaign danr@cs.uiuc L2R.cs.uiuc/~danr

Presentation Transcript

University of Illinois at Urbana-Champaign (UIUC)

University of Illinois at Urbana-Champaign UIUC

Champaign/Urbana, Illinois

UIUC - CS 433 IBM POWER7

Dan Roth Department of Computer Science University of Illinois at Urbana-Champaign

University of Illinois at Urbana-Champaign

Dan Roth Department of Computer Science University of Illinois at Urbana-Champaign

Dan Roth Department of Computer Science University of Illinois at Urbana-Champaign

Dan Roth Department of Computer Science University of Illinois at Urbana-Champaign

University of Illinois Urbana-Champaign

Dan Roth University of Illinois, Urbana-Champaign danr@cs.uiuc L2R.cs.uiuc/~danr

Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

UIUC - CS 433 IBM POWER7

University of Illinois at Urbana-Champaign

Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Paul M. Goldbart University of Illinois at Urbana-Champaign goldbart@uiuc

Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

Dan Roth Department of Computer Science University of Illinois at Urbana/Champaign

Mark Hasegawa-Johnson jhasegaw@uiuc University of Illinois at Urbana-Champaign, USA

University of Illinois Urbana-Champaign

UIUC - CS 433 IBM Power7