Machine Learning in Natural Language

Machine Learning in Natural Language No Lecture on Thursday. Instead: Monday, 4pm, 1404SC Mark Johnson lectures on: Bayesian Models of Language Acquisition

Machine Learning in Natural LanguageFeatures and Kernels The idea of kernels Kernel Perceptron Structured Kernels Tree and Graph Kernels Lessons Multi-class classification

Whether Weather Can be done explicitly (generate expressive features) or implicitly (use kernels). Embedding New discriminator in functionally simpler

Kernel Based Methods • A method to run Perceptron on a very large feature set, without incurring the cost of keeping a very large weight vector. • Computing the weight vector is done in the original space. • Notice: this pertains only to efficiency. • Generalization is still relative to the real dimensionality. • This is the main trick in SVMs. (Algorithm - different) (although many applications actually use linear kernels).

Kernel Base Methods • Let I be the set t1,t2,t3…of monomials (conjunctions) over The feature space x1, x2… xn. • Then we can write a linear function over this new feature space.

Kernel Based Methods • Great Increase in expressivity • Can run Perceptron, Winnow, Logistics regression, but the convergence bound may suffer exponential growth. • Exponential number of monomials are true in each example. • Also, will have to keep many weights.

The Kernel Trick(1) • Consider the value of w used in the prediction. • Each previous mistake, on example z, makes an additive contribution of +/-1 to w, iff t(z) = 1. • The value of wis determined by the number of mistakes on which t() was satisfied.

The Kernel Trick(2) • P – set of examples on which we Promoted • D – set of examples on which we Demoted • M = P D

The Kernel Trick(3) • P – set of examples on which we Promoted • D – set of examples on which we Demoted • M = P D • Where S(z)=1 if z P and S(z) = -1 if z D. Reordering:

The Kernel Trick(4) • S(y)=1 if y P and S(y) = -1 if y D. • A mistake on z contributes the value +/-1 to all monomials satisfied by z. The total contribution of z to the sum is equal to the number of monomials that satisfy both x and z. • Define a dot product in the t-space: • We get the standard notation:

Kernel Based Methods • What does this representation give us? • We can view this Kernel as the distance between x,z measured in the t-space. • But, K(x,z) can be computed in the original space, without explicitly writing the t-representation of x, z

Kernel Based Methods • Consider the space of all 3n monomials (allowing both positive and negative literals). • Then, • if same(x,z) is the number of features that have the same value for both x and z.. We get: • Example: Take n=2; x=(00), z=(01), …. • Proof: let k=same(x,z); choose to (1)include the literal with the right polarity in the monomial, or (2) not include at all. • Other Kernels can be used.

Implementation • Simply run Perceptron in an on-line mode, but keep track of the set M. • Keeping the set M allows to keep track of S(z). • Rather than remembering the weight vector w, • remember the set M (P and D) – all those examples on which we made mistakes. Dual Representation

Summary – Kernel Based Methods I • A method to run Perceptron on a very large feature set, without incurring the cost of keeping a very large weight vector. • Computing the weight vector can still be done in the original feature space. • Notice: this pertains only to efficiency: The classifier is identical to the one you get by blowing up the feature space. • Generalization is still relative to the real dimensionality. • This is the main trick in SVMs. (Algorithm - different) (although most applications actually use linear kernels)

p2 p2 p2 Summary – Kernel Trick • Separating hyperplanes (produced by Perceptron, SVM) can be computed in terms of dot products over a feature based representation of examples. • We want to define a dot product in a high dimensional space. • Given two examples x = (x1, x2, …xn) and y = (y1,y2, …yn) we want to map them to a high dimensional space [example- quadratic]: • (x1,x2,…xn) = (x1,…xn, x12,…xn2, x1¢ x2, …,xn-1¢ xn) • (y1,y2,…yn) = (y1,…yn ,y12,…yn2, y1¢ y2,…,yn-1¢ yn) • And compute the dot product A = (x) ¢(y) [takes time ] • Instead, in the original space, compute • B = f(x ¢ y)= [1+ (x1,x2, …xn)¢(y1,y2, …yn)]2 • Theorem: A = B • Coefficients do not really matter; can be done for other functions.

Efficiency-Generalization Tradeoff • There is a tradeoff between the computationalefficiency with which these kernels can be computed and the generalization ability of the classifier. • For example, using such kernels the Perceptron algorithm can make an exponential number of mistakes even when learning simple functions. • In addition, computing with kernels depends strongly on the number of examples. It turns out that sometimes working in the blown up space is more efficient than using kernels. • Next: More Complicated Kernels

afternoon, Dr. Ab C …in Ms. De. F class.. join Word= POS= IS-A= … will as board a John director the Structured Input S = John will join the board as a director [NP Which type] [PP of ] [NP submarine] [VP was bought ] [ADVPrecently ] [PP by ] [NP South Korea ] (. ?) Knowledge Representation

Learning From Structured Input • We want to extract features from structureddomain elements • their internal (hierarchical) structure should be encoded. • A feature is a mapping from the instances space to {0,1} or [0,1] • With appropriate representation language it is possible to represent expressive features that constitute infinite dimensional space [FEX] • Learning can be done in the infinite attribute domain. • What does it mean to extract features? • Conceptually: different data instantiations may be abstracted to yield the same representation (quantified elements) • Computationally: Some kind of graph matching process • Challenge: • Provide the expressivity necessary to deal with large scale and highly structured domains • Meet the strong tractability requirements for these tasks.

Example • Only those descriptions that are ACTIVE in the input are listed • Michael Collins developed kernels over parse trees. • Cumby/Roth developed parameterized kernels over structures. • When is it better to use kernel vs. using the primal representation. D = (AND word (before tag)) Explicit features

Overview – Goals (Cumby&Roth 2003) • Applying kernel learning methods to structured domains. • Develop a unified formalism for structured kernels. (Collins & Duffy, Gaertner & Lloyd, Haussler) • Flexible language that measures distance between structure with respect to a given ‘substructure’. • Examine complexity & generalization between different feature sets, learners. • When does each type of feature set perform better with what learners? • Exemplify with experiments from bioinformatics & NLP. • Mutagenesis, Named-Entity prediction.

Feature Description Logic • A flexible knowledge representation for feature extraction from structured data • Domain Elements are represented as labeled graphs • Concept graphs that correspond to FDL expressions. • FDL is formed from an alphabet of • attributes, value, and role symbols. • Well defined syntax and equivalent semantics • E.g., descriptions are defined inductively with sensors as primitives • Sensor: a basic description – a term of the form a(v), or a • a = attributesymbol, v = valuesymbol(ground sensor). • existential sensor a describes object that has some value for attribute a. • AND clauses, (role D) clauses for relations between objects, • Expressive and Efficient Feature extraction. Knowledge Representation

Example (Cont.) • Features; Feature Generation Functions; extensions Subsumption… (see paper) Basically: • Only those descriptions that are ACTIVE in the input are listed • The language is expressive enough to generate linguistically interesting features such as agreements, etc. D = (AND word (before tag)) {Dθ} = {(AND word(the) (before tag(N)), (AND word(dog) (before tag(V)), (AND word(ran) (before tag(ADV)), (AND word(very) (before tag(ADJ))} Explicit features

Kernels • It’s possible to define FDL based Kernels for structured data • When using linear classifiers it is important to enhance the set of features to gain expressivity. • A common way - blow up the feature space by generating functions of primitive features. • For some algorithms – SVM, Perceptron - Kernel functions can be used to expand the feature space while working still in the original space. • Is it worth doing in structured domains? • Answers are not clear so far • Computationally: yes, when we simulate a huge space • Generalization: not always [Khardon,Roth,Servedio,NIPS’01; Ben David et al.] Kernels

Generalization issues &Computation issues [if # of examples large] If feature space is explicitly expanded – can use algorithms such as Winnow (SNoW); [complexity and experimental results] Kernels in Structured Domains • We define a Kernel family K parameterized by FDL descriptions. • The definition is recursive on the definition of D [sensor, existential sensor; role description; AND] Key: Many previous structured kernels considered all substructures.(e.g., Collins&Duffy02, Tree Kernels); Analogous to an exponential feature space; over fitting. Kernels

FDL Kernel Definition • Kernel family K parameterized by feature type descriptions. For description D : • If D is a sensor s(v) is a label of then • If D is a sensor s and sensor descriptions s(v1), s(v2)… s(vj) are labels of both then • If D is a role description (r D’), then with n1’, n2’ those nodes that have r –labeled edge from n1,n2. • If D is a description (AND D1 D2 ... Dn) with li repetitions of any Di then Kernels

Kernel Example • D = (AND word (before word)) • G1: The dog ran very fast • G2: The dog ran quickly • Etc. the final output is 2 since there are 2 matching collocations. • Can simulate Boolean kernels as seen in Khardon,Roth et al. Kernels

Complexity & Generalization • How to compare in complexity and generalization to other kernels for structured data? • for m examples, with average example size g, and time to evaluate the kernel t1, kernel Perceptron takes O(m2g2t1) • if extracting a feature explicitly takes t2 , Perceptron takes O(mgt2). • most kernels that simulate a well defined feature space have t1 << t2 . • By restricting size of expanded feature space we avoid overfitting – even SVM suffers under many irrelevant features (Weston). • Margin argument: Margin goes down when you have more features. • given a linearly separable set of points S = {x1,…xm} 2 Rn with separator w 2 Rn • embed Sinto an n’>n dimensional space by adding zero-mean random noise eto the additional n’-ndimensions s.t. w’= (w,0) 2 Rn’ still separates S. • Now margin • but & Analysis

Experiments • Serve as comparison – Our features w/ kernel Perc, normal Winnow, and all-subtrees expanded features. • Bioinformatics experiment in mutagenesis prediction: • 188 compounds with atom-bond data, binary prediction. • 10-fold cross validation with 12 runs training • NLP experiment in classifying detected NE’s: • 4700 training 1500 test phrases from MUC-7 • person, location, & organization • Trained and tested with kernel Perceptron, Winnow (Snow) classifiers with FDL kernel & respective features. Also all-subtrees kernel based on Collins & Duffy work. Mutagenesis concept graph Features simulated with all-subtrees kernel

Discussion • microaveraged accuracy • Have kernel that simulates features obtained with FDL • But quadratic training time means cheaper to extract and learn explicitly vs kernel Perceptron • SVM could take (slightly) even longer, but maybe perform better • But restricted features might workbetter than larger spaces simulated by other kernels. • Can we improve on benefits of useful features? • Compile examples together ? • More sophisticated kernels than matching kernel? • Still provides metric for similarity based approaches.

Conclusion • Kernels for learning from structured data is an interesting idea • Different kernels may expand/restrict the hypothesis space in useful ways. • Need to know the benefits and hazards • To justify these methods we must embed in a space much larger than the training set size. • Can decrease margin • Expressive knowledge representations can be used to create features explicitly or in implicit kernel-spaces. • Data representation could allow us to plug in different base kernels to replace matching kernel. • Parameterized kernel allows us to direct the way the feature space is blown up to encode background knowledge.

Machine Learning in Natural Language

Machine Learning in Natural Language

Presentation Transcript

CS546 Spring 2009 Machine Learning in Natural Language

Language Resources and Machine Learning

CS 595-052 Machine Learning and Statistical Natural Language Processing

CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers

Empirical Learning Methods in Natural Language Processing

Machine Learning for Natural Language Processing

Natural Language Learning: Linear models

Natural Language Learning: MaxImum entropy

An Introduction to Machine Learning and Natural Language Processing Tools

Language Technology Machine learning of natural language

Machine Translation ICS 482 Natural Language Processing

CS598 DNR FALL 2005 Machine Learning in Natural Language

Machine Learning in Natural Language More on Discriminative models

Natural Language to Machine Readable Format

Global Inference in Learning for Natural Language Processing

Machine Learning Natural Language Processing

AI – Week 23 – TERM 2 Machine Learning and Natural Language Processing

Machine Learning in Spoken Language Processing

Statistical Learning Methods in Natural Language Processing

Machine Learning in Natural Language

Best programming language for machine learning

CS 391L: Machine Learning Natural Language Learning