Seminar on Machine Learning for NLP Applications

Seminar: Statistical NLP Machine Learning for Natural Language Processing Lluís Màrquez TALP Research Center Llenguatges i Sistemes Informàtics Universitat Politècnica de Catalunya Girona, June 2003

Outline • Machine Learning for NLP • The Classification Problem • Three ML Algorithms • Applications to NLP

There are many general-purpose definitions of Machine Learning (or artificial learning): Making a computer automatically acquire some kind of knowledge from a concrete data domain ML4NLP Machine Learning • Learners are computers: we study learning algorithms • Resources are scarce: time, memory, data, etc. • It has (almost) nothing to do with: Cognitive science, neuroscience, theory of scientific discovery and research, etc. • Biological plausibility is welcome but not the main goal

We will concentrate on: Supervisedinductive learning for classification = discriminative learning ML4NLP Machine Learning • Learning... but what for? • To perform some particular task • To react to environmental inputs • Concept learning from data: • modelling concepts underlying data • predictingunseen observations • compacting the knowledge representation • knowledge discovery for expert systems

What to read? Machine Learning (Mitchell, 1997) Obtaining a description of the concept in some representation language that explains observations and helps predicting new instances of the same distribution ML4NLP Machine Learning A more precise definition:

Lexical and structural ambiguity problems Word selection (SR, MT) Part-of-speech tagging Semantic ambiguity (polysemy) Prepositional phrase attachment Reference ambiguity (anaphora) etc. Clasification problems ML4NLP Empirical NLP 90’s: Application of Machine Learning techniques (ML) to NLP problems • What to read? Foundations of Statistical Language Processing (Manning & Schütze, 1999)

Ambiguity is a crucial problem for natural language understanding/processing. Ambiguity Resolution = Classification ML4NLP NLP “classification” problems • He was shot in the hand as he chased the robbers in the back street (The Wall Street Journal Corpus)

Morpho-syntactic ambiguity ML4NLP NLP “classification” problems • He was shot in the hand as he chased the robbers in the back street NN VB JJ VB NN VB (The Wall Street Journal Corpus)

Morpho-syntactic ambiguity: Part of Speech Tagging ML4NLP NLP “classification” problems • He was shot in the hand as he chased the robbers in the back street NN VB JJ VB NN VB (The Wall Street Journal Corpus)

Semantic (lexical) ambiguity ML4NLP NLP “classification” problems • He was shot in the hand as he chased the robbers in the back street body-part clock-part (The Wall Street Journal Corpus)

Semantic (lexical) ambiguity: Word Sense Disambiguation ML4NLP NLP “classification” problems • He was shot in the hand as he chased the robbers in the back street body-part clock-part (The Wall Street Journal Corpus)

Structural (syntactic) ambiguity ML4NLP NLP “classification” problems • He was shot in the hand as he chased the robbers in the back street (The Wall Street Journal Corpus)

Structural (syntactic) ambiguity ML4NLP NLP “classification” problems • He was shot in the hand as he chasedthe robbersin the back street (The Wall Street Journal Corpus)

Structural (syntactic) ambiguity:PP-attachment disambiguation ML4NLP NLP “classification” problems • He was shot in the hand as he (chased (the robbers)NP(in the back street)PP) (The Wall Street Journal Corpus)

Outline • Machine Learning for NLP • The Classification Problem • Three ML Algorithms in detail • Applications to NLP

An instance is a vector: x=<x1,…, xn>whose components, called features (or attributes), are discrete or real-valued. Let X be the space of all possible instances. Let Y={y1,…, ym}be the set of categories (or classes). The goal is to learn an unknown target function, f : X Y A training exampleis an instance xbelonging to X, labelled with the correct value for f(x), i.e., a pair <x, f(x)> Let D be the set of all training examples. Classification Feature Vector Classification IA perspective

The goal is to find a function h belonging to H such that for all pair <x,f(x)>belonging to D, h(x) = f(x) Classification Feature Vector Classification • The hypotheses space, H, is the set of functions h: X Y that the learner can consider as possible definitions

Decision Tree Rules COLOR (COLOR=red) Ù (SHAPE=circle) Þ positive blue red SHAPE negative circle triangle positive negative Classification An Example otherwise Þ negative

Decision Tree Rules SIZE (SIZE=small)Ù(SHAPE=circle) Þ positive small big (SIZE=big)Ù(COLOR=red) Þ positive SHAPE COLOR otherwise Þ negative red circle triang blue neg pos pos neg Classification An Example

Inductive Bias “Any means that a classification learning system uses to choose between to functions that are both consistent with the training data is called inductive bias” (Mooney & Cardie, 99) Language / Search bias Decision Tree COLOR blue red SHAPE negative circle triangle positive negative Classification Some important concepts

Inductive Bias Training error and generalization error Classification Some important concepts • Generalization ability and overfitting • Batch Learning vs. on-line Leaning • Symbolic vs. statistical Learning • Propositional vs. first-order learning

Relational learning = ILP (induction of logic programs) course(X) Ù person(Y) Ù link_to(Y,X) Þinstructor_of(X,Y) research_project(X) Ù person(Z) Ù link_to(L1,X,Y) Ù link_to(L2,Y,Z)Ù neighbour_word_people(L1)Þmember_proj(X,Z) Classification Propositional vs. Relational Learning • Propositional learning color(red) Ù shape(circle) ÞclassA

Classification The Classification SettingClass, Point, Example, Data Set, ... CoLT/SLT perspective • Input Space: XRn • (binary) Output Space: Y = {+1,-1} • A point, pattern or instance:x  X, x = (x1, x2, …, xn) • Example: (x, y)with x  X, y Y • Training Set: a set of m examples generated i.i.d. according to an unknown distribution P(x,y)S = {(x1, y1), …, (xm, ym)}  (X  Y)m

Classification The Classification SettingLearning, Error, ... • The hypotheses space, H, is the set of functions h: XY that the learner can consider as possible definitions. In SVM are of the form: • The goal is to find a function h belonging to H such that the expected misclassification error on new examples, also drawn from P(x,y), is minimal (Risk Minimization, RM)

Classification The Classification SettingLearning, Error, ... • Expected error (risk) • Problem: P itself is unknown. Known are training examples an induction principle is needed • Empirical Risk Minimization (ERM): Find the function h belonging to H for which the training error (empirical risk) is minimal

Overfitting Underfitting Classification The Classification SettingError, Over(under)fitting,... • Low training error  low true error? • The overfitting dilemma: (Müller et al., 2001) • Trade-off between training error and complexity • Different learning biases can be used

Outline • Machine Learning for NLP • The Classification Problem • Three ML Algorithms • Applications to NLP

Outline • Machine Learning for NLP • The Classification Problem • Three ML Algorithms • Decision Trees • AdaBoost • Support Vector Machines • Applications to NLP

Algorithms Learning Paradigms • Statistical learning: • HMM, Bayesian Networks, ME, CRF, etc. • Traditional methods from Artificial Intelligence (ML, AI) • Decision trees/lists, exemplar-based learning, rule induction, neural networks, etc. • Methods from Computational Learning Theory (CoLT/SLT) • Winnow, AdaBoost, SVM’s, etc.

Algorithms Learning Paradigms • Classifier combination: • Bagging, Boosting, Randomization, ECOC, Stacking, etc. • Semi-supervised learning: learning from labelled and unlabelled examples • Bootstrapping, EM, Transductive learning (SVM’s, AdaBoost), Co-Training, etc. • etc.

Algorithms Decision Trees • Decision trees are a way to represent rules underlying training data, with hierarchical structures that recursively partition the data. • They have been used by many research communities (Pattern Recognition, Statistics, ML, etc.) for data exploration with some of the following purposes: Description, Classification, and Generalization. • From a machine-learning perspective: Decision Trees are n-ary branching trees that represent classification rules for classifying the objects of a certain domain into a set of mutually exclusive classes

Algorithms Decision Trees • Acquisition: Top-Down Induction of Decision Trees (TDIDT) • Systems: CART (Breiman et al. 84), ID3, C4.5, C5.0 (Quinlan 86,93,98), ASSISTANT, ASSISTANT-R (Cestnik et al. 87) (Kononenko et al. 95) etc.

A1 v1 v3 v2 ... A2 A2 A3 ... ... v5 v4 Decision Tree A5 A2 ... SIZE v6 small big C3 A5 SHAPE COLOR v7 red circle triang blue C1 C2 C1 neg pos pos neg Algorithms An Example

Training DT Training Set + TDIDT = Test DT + = Example Class Algorithms Learning Decision Trees

functionTDIDT (X:set-of-examples; A:set-of-features) var: tree1,tree2: decision-tree; X’: set-of-examples; A’: set-of-features end-var if (stopping_criterion(X)) then tree1 := create_leaf_tree(X) else amax := feature_selection(X,A); tree1 := create_tree(X, amax); for-all val invalues(amax) do X’ := select_examples(X,amax,val); A’ := A - {amax}; tree2 := TDIDT(X’,A’); tree1 := add_branch(tree1,tree2,val) end-for end-if return(tree1) end-function Algorithms General Induction Algorithm

Functions derived from Information Theory: Information Gain, Gain Ratio (Quinlan 86) Functions derived from Distance Measures Gini Diversity Index (Breiman et al. 84) RLM (López de Mántaras 91) Statistically-based Chi-square test (Sestito & Dillon 94) Symmetrical Tau (Zhou & Dillon 91) RELIEFF-IG: variant of RELIEFF (Kononenko 94) Algorithms Feature Selection Criteria

Algorithms Extensions of DTs (Murthy 95) • Pruning (pre/post) • Minimize the effect of the greedy approach: lookahead • Non-lineal splits • Combination of multiple models • Incremental learning (on-line) • etc.

Algorithms Decision Trees and NLP • Speech processing (Bahl et al. 89; Bakiri & Dietterich 99) • POS Tagging (Cardie 93, Schmid 94b; Magerman 95; Màrquez & Rodríguez 95,97; Màrquez et al. 00) • Word sense disambiguation (Brown et al. 91; Cardie 93; Mooney 96) • Parsing (Magerman 95,96; Haruno et al. 98,99) • Text categorization (Lewis & Ringuette 94; Weiss et al. 99) • Text summarization (Mani & Bloedorn 98) • Dialogue act tagging (Samuel et al. 98)

Algorithms Decision Trees and NLP • Noun phrase coreference (Aone & Benett 95; Mc Carthy & Lehnert 95) • Discourse analysis in information extraction (Soderland & Lehnert 94) • Cue phrase identification in text and speech (Litman 94; Siegel & McKeown 94) • Verb classification in Machine Translation (Tanaka 96; Siegel 97)

Algorithms Decision Trees: pros&cons • Advantages • Acquires symbolic knowledge in a understandable way • Very well studied ML algorithms and variants • Can be easily translated into rules • Existence of available software: C4.5, C5.0, etc. • Can be easily integrated into an ensemble

Algorithms Decision Trees: pros&cons • Drawbacks • Computationally expensive when scaling to large natural language domains: training examples, features, etc. • Data sparseness and data fragmentation: the problem of the small disjuncts => Probability estimation • DTs is a model with high variance (unstable) • Tendency to overfit training data: pruning is necessary • Requires quite a big effort in tuning the model

Algorithms Boosting algorithms • Idea “to combine many simple and moderately accurate hypotheses (weak classifiers) into a single and highly accurate classifier” • AdaBoost(Freund & Schapire 95) has been theoretically and empirically studied extensively • Many other variants extensions (1997-2003) http://www.lsi.upc.es/~lluism/seminari/ml&nlp.html

Linear combination TEST F(h1,h2,...,hT) a1 a2 aT hT h1 h2 . . . Weak Learner Weak Learner Weak Learner Probability distribution updating TS1 TST TS2 . . . D1 DT D2 Algorithms AdaBoost: general scheme TRAINING

(Freund & Schapire 97) Algorithms AdaBoost: algorithm

Algorithms AdaBoost: example Weak hypotheses = vertical/horizontal hyperplanes

Algorithms AdaBoost: round 1

Seminar on Machine Learning for NLP Applications