900 likes | 907 Views
Explore the intersection of Machine Learning and Natural Language Processing (NLP) in this seminar. Learn about classification problems, ML algorithms, and applications of ML in NLP. Delve into ambiguity resolution and structural classifications.
E N D
Seminar: Statistical NLP Machine Learning for Natural Language Processing Lluís Màrquez TALP Research Center Llenguatges i Sistemes Informàtics Universitat Politècnica de Catalunya Girona, June 2003
Outline • Machine Learning for NLP • The Classification Problem • Three ML Algorithms • Applications to NLP
Outline • Machine Learning for NLP • The Classification Problem • Three ML Algorithms • Applications to NLP
There are many general-purpose definitions of Machine Learning (or artificial learning): Making a computer automatically acquire some kind of knowledge from a concrete data domain ML4NLP Machine Learning • Learners are computers: we study learning algorithms • Resources are scarce: time, memory, data, etc. • It has (almost) nothing to do with: Cognitive science, neuroscience, theory of scientific discovery and research, etc. • Biological plausibility is welcome but not the main goal
We will concentrate on: Supervisedinductive learning for classification = discriminative learning ML4NLP Machine Learning • Learning... but what for? • To perform some particular task • To react to environmental inputs • Concept learning from data: • modelling concepts underlying data • predictingunseen observations • compacting the knowledge representation • knowledge discovery for expert systems
What to read? Machine Learning (Mitchell, 1997) Obtaining a description of the concept in some representation language that explains observations and helps predicting new instances of the same distribution ML4NLP Machine Learning A more precise definition:
Lexical and structural ambiguity problems Word selection (SR, MT) Part-of-speech tagging Semantic ambiguity (polysemy) Prepositional phrase attachment Reference ambiguity (anaphora) etc. Clasification problems ML4NLP Empirical NLP 90’s: Application of Machine Learning techniques (ML) to NLP problems • What to read? Foundations of Statistical Language Processing (Manning & Schütze, 1999)
Ambiguity is a crucial problem for natural language understanding/processing. Ambiguity Resolution = Classification ML4NLP NLP “classification” problems • He was shot in the hand as he chased the robbers in the back street (The Wall Street Journal Corpus)
Morpho-syntactic ambiguity ML4NLP NLP “classification” problems • He was shot in the hand as he chased the robbers in the back street NN VB JJ VB NN VB (The Wall Street Journal Corpus)
Morpho-syntactic ambiguity: Part of Speech Tagging ML4NLP NLP “classification” problems • He was shot in the hand as he chased the robbers in the back street NN VB JJ VB NN VB (The Wall Street Journal Corpus)
Semantic (lexical) ambiguity ML4NLP NLP “classification” problems • He was shot in the hand as he chased the robbers in the back street body-part clock-part (The Wall Street Journal Corpus)
Semantic (lexical) ambiguity: Word Sense Disambiguation ML4NLP NLP “classification” problems • He was shot in the hand as he chased the robbers in the back street body-part clock-part (The Wall Street Journal Corpus)
Structural (syntactic) ambiguity ML4NLP NLP “classification” problems • He was shot in the hand as he chased the robbers in the back street (The Wall Street Journal Corpus)
Structural (syntactic) ambiguity ML4NLP NLP “classification” problems • He was shot in the hand as he chasedthe robbersin the back street (The Wall Street Journal Corpus)
Structural (syntactic) ambiguity:PP-attachment disambiguation ML4NLP NLP “classification” problems • He was shot in the hand as he (chased (the robbers)NP(in the back street)PP) (The Wall Street Journal Corpus)
Outline • Machine Learning for NLP • The Classification Problem • Three ML Algorithms in detail • Applications to NLP
An instance is a vector: x=<x1,…, xn>whose components, called features (or attributes), are discrete or real-valued. Let X be the space of all possible instances. Let Y={y1,…, ym}be the set of categories (or classes). The goal is to learn an unknown target function, f : X Y A training exampleis an instance xbelonging to X, labelled with the correct value for f(x), i.e., a pair <x, f(x)> Let D be the set of all training examples. Classification Feature Vector Classification IA perspective
The goal is to find a function h belonging to H such that for all pair <x,f(x)>belonging to D, h(x) = f(x) Classification Feature Vector Classification • The hypotheses space, H, is the set of functions h: X Y that the learner can consider as possible definitions
Decision Tree Rules COLOR (COLOR=red) Ù (SHAPE=circle) Þ positive blue red SHAPE negative circle triangle positive negative Classification An Example otherwise Þ negative
Decision Tree Rules SIZE (SIZE=small)Ù(SHAPE=circle) Þ positive small big (SIZE=big)Ù(COLOR=red) Þ positive SHAPE COLOR otherwise Þ negative red circle triang blue neg pos pos neg Classification An Example
Inductive Bias “Any means that a classification learning system uses to choose between to functions that are both consistent with the training data is called inductive bias” (Mooney & Cardie, 99) Language / Search bias Decision Tree COLOR blue red SHAPE negative circle triangle positive negative Classification Some important concepts
Inductive Bias Training error and generalization error Classification Some important concepts • Generalization ability and overfitting • Batch Learning vs. on-line Leaning • Symbolic vs. statistical Learning • Propositional vs. first-order learning
Relational learning = ILP (induction of logic programs) course(X) Ù person(Y) Ù link_to(Y,X) Þinstructor_of(X,Y) research_project(X) Ù person(Z) Ù link_to(L1,X,Y) Ù link_to(L2,Y,Z)Ù neighbour_word_people(L1)Þmember_proj(X,Z) Classification Propositional vs. Relational Learning • Propositional learning color(red) Ù shape(circle) ÞclassA
Classification The Classification SettingClass, Point, Example, Data Set, ... CoLT/SLT perspective • Input Space: XRn • (binary) Output Space: Y = {+1,-1} • A point, pattern or instance:x X, x = (x1, x2, …, xn) • Example: (x, y)with x X, y Y • Training Set: a set of m examples generated i.i.d. according to an unknown distribution P(x,y)S = {(x1, y1), …, (xm, ym)} (X Y)m
Classification The Classification SettingLearning, Error, ... • The hypotheses space, H, is the set of functions h: XY that the learner can consider as possible definitions. In SVM are of the form: • The goal is to find a function h belonging to H such that the expected misclassification error on new examples, also drawn from P(x,y), is minimal (Risk Minimization, RM)
Classification The Classification SettingLearning, Error, ... • Expected error (risk) • Problem: P itself is unknown. Known are training examples an induction principle is needed • Empirical Risk Minimization (ERM): Find the function h belonging to H for which the training error (empirical risk) is minimal
Overfitting Underfitting Classification The Classification SettingError, Over(under)fitting,... • Low training error low true error? • The overfitting dilemma: (Müller et al., 2001) • Trade-off between training error and complexity • Different learning biases can be used
Outline • Machine Learning for NLP • The Classification Problem • Three ML Algorithms • Applications to NLP
Outline • Machine Learning for NLP • The Classification Problem • Three ML Algorithms • Decision Trees • AdaBoost • Support Vector Machines • Applications to NLP
Algorithms Learning Paradigms • Statistical learning: • HMM, Bayesian Networks, ME, CRF, etc. • Traditional methods from Artificial Intelligence (ML, AI) • Decision trees/lists, exemplar-based learning, rule induction, neural networks, etc. • Methods from Computational Learning Theory (CoLT/SLT) • Winnow, AdaBoost, SVM’s, etc.
Algorithms Learning Paradigms • Classifier combination: • Bagging, Boosting, Randomization, ECOC, Stacking, etc. • Semi-supervised learning: learning from labelled and unlabelled examples • Bootstrapping, EM, Transductive learning (SVM’s, AdaBoost), Co-Training, etc. • etc.
Algorithms Decision Trees • Decision trees are a way to represent rules underlying training data, with hierarchical structures that recursively partition the data. • They have been used by many research communities (Pattern Recognition, Statistics, ML, etc.) for data exploration with some of the following purposes: Description, Classification, and Generalization. • From a machine-learning perspective: Decision Trees are n-ary branching trees that represent classification rules for classifying the objects of a certain domain into a set of mutually exclusive classes
Algorithms Decision Trees • Acquisition: Top-Down Induction of Decision Trees (TDIDT) • Systems: CART (Breiman et al. 84), ID3, C4.5, C5.0 (Quinlan 86,93,98), ASSISTANT, ASSISTANT-R (Cestnik et al. 87) (Kononenko et al. 95) etc.
A1 v1 v3 v2 ... A2 A2 A3 ... ... v5 v4 Decision Tree A5 A2 ... SIZE v6 small big C3 A5 SHAPE COLOR v7 red circle triang blue C1 C2 C1 neg pos pos neg Algorithms An Example
Training DT Training Set + TDIDT = Test DT + = Example Class Algorithms Learning Decision Trees
functionTDIDT (X:set-of-examples; A:set-of-features) var: tree1,tree2: decision-tree; X’: set-of-examples; A’: set-of-features end-var if (stopping_criterion(X)) then tree1 := create_leaf_tree(X) else amax := feature_selection(X,A); tree1 := create_tree(X, amax); for-all val invalues(amax) do X’ := select_examples(X,amax,val); A’ := A - {amax}; tree2 := TDIDT(X’,A’); tree1 := add_branch(tree1,tree2,val) end-for end-if return(tree1) end-function Algorithms General Induction Algorithm
functionTDIDT (X:set-of-examples; A:set-of-features) var: tree1,tree2: decision-tree; X’: set-of-examples; A’: set-of-features end-var if (stopping_criterion(X)) then tree1 := create_leaf_tree(X) else amax := feature_selection(X,A); tree1 := create_tree(X, amax); for-all val invalues(amax) do X’ := select_examples(X,amax,val); A’ := A - {amax}; tree2 := TDIDT(X’,A’); tree1 := add_branch(tree1,tree2,val) end-for end-if return(tree1) end-function Algorithms General Induction Algorithm
Functions derived from Information Theory: Information Gain, Gain Ratio (Quinlan 86) Functions derived from Distance Measures Gini Diversity Index (Breiman et al. 84) RLM (López de Mántaras 91) Statistically-based Chi-square test (Sestito & Dillon 94) Symmetrical Tau (Zhou & Dillon 91) RELIEFF-IG: variant of RELIEFF (Kononenko 94) Algorithms Feature Selection Criteria
Algorithms Extensions of DTs (Murthy 95) • Pruning (pre/post) • Minimize the effect of the greedy approach: lookahead • Non-lineal splits • Combination of multiple models • Incremental learning (on-line) • etc.
Algorithms Decision Trees and NLP • Speech processing (Bahl et al. 89; Bakiri & Dietterich 99) • POS Tagging (Cardie 93, Schmid 94b; Magerman 95; Màrquez & Rodríguez 95,97; Màrquez et al. 00) • Word sense disambiguation (Brown et al. 91; Cardie 93; Mooney 96) • Parsing (Magerman 95,96; Haruno et al. 98,99) • Text categorization (Lewis & Ringuette 94; Weiss et al. 99) • Text summarization (Mani & Bloedorn 98) • Dialogue act tagging (Samuel et al. 98)
Algorithms Decision Trees and NLP • Noun phrase coreference (Aone & Benett 95; Mc Carthy & Lehnert 95) • Discourse analysis in information extraction (Soderland & Lehnert 94) • Cue phrase identification in text and speech (Litman 94; Siegel & McKeown 94) • Verb classification in Machine Translation (Tanaka 96; Siegel 97)
Algorithms Decision Trees: pros&cons • Advantages • Acquires symbolic knowledge in a understandable way • Very well studied ML algorithms and variants • Can be easily translated into rules • Existence of available software: C4.5, C5.0, etc. • Can be easily integrated into an ensemble
Algorithms Decision Trees: pros&cons • Drawbacks • Computationally expensive when scaling to large natural language domains: training examples, features, etc. • Data sparseness and data fragmentation: the problem of the small disjuncts => Probability estimation • DTs is a model with high variance (unstable) • Tendency to overfit training data: pruning is necessary • Requires quite a big effort in tuning the model
Algorithms Boosting algorithms • Idea “to combine many simple and moderately accurate hypotheses (weak classifiers) into a single and highly accurate classifier” • AdaBoost(Freund & Schapire 95) has been theoretically and empirically studied extensively • Many other variants extensions (1997-2003) http://www.lsi.upc.es/~lluism/seminari/ml&nlp.html
Linear combination TEST F(h1,h2,...,hT) a1 a2 aT hT h1 h2 . . . Weak Learner Weak Learner Weak Learner Probability distribution updating TS1 TST TS2 . . . D1 DT D2 Algorithms AdaBoost: general scheme TRAINING
(Freund & Schapire 97) Algorithms AdaBoost: algorithm
Algorithms AdaBoost: example Weak hypotheses = vertical/horizontal hyperplanes
Algorithms AdaBoost: round 1
Algorithms AdaBoost: round 2
Algorithms AdaBoost: round 3