300 likes | 401 Views
Machine Learning in Natural Language. No Lecture on Thursday. Instead: Monday, 4pm, 1404SC Mark Johnson lectures on: Bayesian Models of Language Acquisition. Machine Learning in Natural Language Features and Kernels. The idea of kernels Kernel Perceptron Structured Kernels
E N D
Machine Learning in Natural Language No Lecture on Thursday. Instead: Monday, 4pm, 1404SC Mark Johnson lectures on: Bayesian Models of Language Acquisition
Machine Learning in Natural LanguageFeatures and Kernels The idea of kernels Kernel Perceptron Structured Kernels Tree and Graph Kernels Lessons Multi-class classification
Whether Weather Can be done explicitly (generate expressive features) or implicitly (use kernels). Embedding New discriminator in functionally simpler
Kernel Based Methods • A method to run Perceptron on a very large feature set, without incurring the cost of keeping a very large weight vector. • Computing the weight vector is done in the original space. • Notice: this pertains only to efficiency. • Generalization is still relative to the real dimensionality. • This is the main trick in SVMs. (Algorithm - different) (although many applications actually use linear kernels).
Kernel Base Methods • Let I be the set t1,t2,t3…of monomials (conjunctions) over The feature space x1, x2… xn. • Then we can write a linear function over this new feature space.
Kernel Based Methods • Great Increase in expressivity • Can run Perceptron, Winnow, Logistics regression, but the convergence bound may suffer exponential growth. • Exponential number of monomials are true in each example. • Also, will have to keep many weights.
The Kernel Trick(1) • Consider the value of w used in the prediction. • Each previous mistake, on example z, makes an additive contribution of +/-1 to w, iff t(z) = 1. • The value of wis determined by the number of mistakes on which t() was satisfied.
The Kernel Trick(2) • P – set of examples on which we Promoted • D – set of examples on which we Demoted • M = P D
The Kernel Trick(3) • P – set of examples on which we Promoted • D – set of examples on which we Demoted • M = P D • Where S(z)=1 if z P and S(z) = -1 if z D. Reordering:
The Kernel Trick(4) • S(y)=1 if y P and S(y) = -1 if y D. • A mistake on z contributes the value +/-1 to all monomials satisfied by z. The total contribution of z to the sum is equal to the number of monomials that satisfy both x and z. • Define a dot product in the t-space: • We get the standard notation:
Kernel Based Methods • What does this representation give us? • We can view this Kernel as the distance between x,z measured in the t-space. • But, K(x,z) can be computed in the original space, without explicitly writing the t-representation of x, z
Kernel Based Methods • Consider the space of all 3n monomials (allowing both positive and negative literals). • Then, • if same(x,z) is the number of features that have the same value for both x and z.. We get: • Example: Take n=2; x=(00), z=(01), …. • Proof: let k=same(x,z); choose to (1)include the literal with the right polarity in the monomial, or (2) not include at all. • Other Kernels can be used.
Implementation • Simply run Perceptron in an on-line mode, but keep track of the set M. • Keeping the set M allows to keep track of S(z). • Rather than remembering the weight vector w, • remember the set M (P and D) – all those examples on which we made mistakes. Dual Representation
Summary – Kernel Based Methods I • A method to run Perceptron on a very large feature set, without incurring the cost of keeping a very large weight vector. • Computing the weight vector can still be done in the original feature space. • Notice: this pertains only to efficiency: The classifier is identical to the one you get by blowing up the feature space. • Generalization is still relative to the real dimensionality. • This is the main trick in SVMs. (Algorithm - different) (although most applications actually use linear kernels)
p2 p2 p2 Summary – Kernel Trick • Separating hyperplanes (produced by Perceptron, SVM) can be computed in terms of dot products over a feature based representation of examples. • We want to define a dot product in a high dimensional space. • Given two examples x = (x1, x2, …xn) and y = (y1,y2, …yn) we want to map them to a high dimensional space [example- quadratic]: • (x1,x2,…xn) = (x1,…xn, x12,…xn2, x1¢ x2, …,xn-1¢ xn) • (y1,y2,…yn) = (y1,…yn ,y12,…yn2, y1¢ y2,…,yn-1¢ yn) • And compute the dot product A = (x) ¢(y) [takes time ] • Instead, in the original space, compute • B = f(x ¢ y)= [1+ (x1,x2, …xn)¢(y1,y2, …yn)]2 • Theorem: A = B • Coefficients do not really matter; can be done for other functions.
Efficiency-Generalization Tradeoff • There is a tradeoff between the computationalefficiency with which these kernels can be computed and the generalization ability of the classifier. • For example, using such kernels the Perceptron algorithm can make an exponential number of mistakes even when learning simple functions. • In addition, computing with kernels depends strongly on the number of examples. It turns out that sometimes working in the blown up space is more efficient than using kernels. • Next: More Complicated Kernels
afternoon, Dr. Ab C …in Ms. De. F class.. join Word= POS= IS-A= … will as board a John director the Structured Input S = John will join the board as a director [NP Which type] [PP of ] [NP submarine] [VP was bought ] [ADVPrecently ] [PP by ] [NP South Korea ] (. ?) Knowledge Representation
Learning From Structured Input • We want to extract features from structureddomain elements • their internal (hierarchical) structure should be encoded. • A feature is a mapping from the instances space to {0,1} or [0,1] • With appropriate representation language it is possible to represent expressive features that constitute infinite dimensional space [FEX] • Learning can be done in the infinite attribute domain. • What does it mean to extract features? • Conceptually: different data instantiations may be abstracted to yield the same representation (quantified elements) • Computationally: Some kind of graph matching process • Challenge: • Provide the expressivity necessary to deal with large scale and highly structured domains • Meet the strong tractability requirements for these tasks.
Example • Only those descriptions that are ACTIVE in the input are listed • Michael Collins developed kernels over parse trees. • Cumby/Roth developed parameterized kernels over structures. • When is it better to use kernel vs. using the primal representation. D = (AND word (before tag)) Explicit features
Overview – Goals (Cumby&Roth 2003) • Applying kernel learning methods to structured domains. • Develop a unified formalism for structured kernels. (Collins & Duffy, Gaertner & Lloyd, Haussler) • Flexible language that measures distance between structure with respect to a given ‘substructure’. • Examine complexity & generalization between different feature sets, learners. • When does each type of feature set perform better with what learners? • Exemplify with experiments from bioinformatics & NLP. • Mutagenesis, Named-Entity prediction.
Feature Description Logic • A flexible knowledge representation for feature extraction from structured data • Domain Elements are represented as labeled graphs • Concept graphs that correspond to FDL expressions. • FDL is formed from an alphabet of • attributes, value, and role symbols. • Well defined syntax and equivalent semantics • E.g., descriptions are defined inductively with sensors as primitives • Sensor: a basic description – a term of the form a(v), or a • a = attributesymbol, v = valuesymbol(ground sensor). • existential sensor a describes object that has some value for attribute a. • AND clauses, (role D) clauses for relations between objects, • Expressive and Efficient Feature extraction. Knowledge Representation
Example (Cont.) • Features; Feature Generation Functions; extensions Subsumption… (see paper) Basically: • Only those descriptions that are ACTIVE in the input are listed • The language is expressive enough to generate linguistically interesting features such as agreements, etc. D = (AND word (before tag)) {Dθ} = {(AND word(the) (before tag(N)), (AND word(dog) (before tag(V)), (AND word(ran) (before tag(ADV)), (AND word(very) (before tag(ADJ))} Explicit features
Kernels • It’s possible to define FDL based Kernels for structured data • When using linear classifiers it is important to enhance the set of features to gain expressivity. • A common way - blow up the feature space by generating functions of primitive features. • For some algorithms – SVM, Perceptron - Kernel functions can be used to expand the feature space while working still in the original space. • Is it worth doing in structured domains? • Answers are not clear so far • Computationally: yes, when we simulate a huge space • Generalization: not always [Khardon,Roth,Servedio,NIPS’01; Ben David et al.] Kernels
Generalization issues &Computation issues [if # of examples large] If feature space is explicitly expanded – can use algorithms such as Winnow (SNoW); [complexity and experimental results] Kernels in Structured Domains • We define a Kernel family K parameterized by FDL descriptions. • The definition is recursive on the definition of D [sensor, existential sensor; role description; AND] Key: Many previous structured kernels considered all substructures.(e.g., Collins&Duffy02, Tree Kernels); Analogous to an exponential feature space; over fitting. Kernels
FDL Kernel Definition • Kernel family K parameterized by feature type descriptions. For description D : • If D is a sensor s(v) is a label of then • If D is a sensor s and sensor descriptions s(v1), s(v2)… s(vj) are labels of both then • If D is a role description (r D’), then with n1’, n2’ those nodes that have r –labeled edge from n1,n2. • If D is a description (AND D1 D2 ... Dn) with li repetitions of any Di then Kernels
Kernel Example • D = (AND word (before word)) • G1: The dog ran very fast • G2: The dog ran quickly • Etc. the final output is 2 since there are 2 matching collocations. • Can simulate Boolean kernels as seen in Khardon,Roth et al. Kernels
Complexity & Generalization • How to compare in complexity and generalization to other kernels for structured data? • for m examples, with average example size g, and time to evaluate the kernel t1, kernel Perceptron takes O(m2g2t1) • if extracting a feature explicitly takes t2 , Perceptron takes O(mgt2). • most kernels that simulate a well defined feature space have t1 << t2 . • By restricting size of expanded feature space we avoid overfitting – even SVM suffers under many irrelevant features (Weston). • Margin argument: Margin goes down when you have more features. • given a linearly separable set of points S = {x1,…xm} 2 Rn with separator w 2 Rn • embed Sinto an n’>n dimensional space by adding zero-mean random noise eto the additional n’-ndimensions s.t. w’= (w,0) 2 Rn’ still separates S. • Now margin • but & Analysis
Experiments • Serve as comparison – Our features w/ kernel Perc, normal Winnow, and all-subtrees expanded features. • Bioinformatics experiment in mutagenesis prediction: • 188 compounds with atom-bond data, binary prediction. • 10-fold cross validation with 12 runs training • NLP experiment in classifying detected NE’s: • 4700 training 1500 test phrases from MUC-7 • person, location, & organization • Trained and tested with kernel Perceptron, Winnow (Snow) classifiers with FDL kernel & respective features. Also all-subtrees kernel based on Collins & Duffy work. Mutagenesis concept graph Features simulated with all-subtrees kernel
Discussion • microaveraged accuracy • Have kernel that simulates features obtained with FDL • But quadratic training time means cheaper to extract and learn explicitly vs kernel Perceptron • SVM could take (slightly) even longer, but maybe perform better • But restricted features might workbetter than larger spaces simulated by other kernels. • Can we improve on benefits of useful features? • Compile examples together ? • More sophisticated kernels than matching kernel? • Still provides metric for similarity based approaches.
Conclusion • Kernels for learning from structured data is an interesting idea • Different kernels may expand/restrict the hypothesis space in useful ways. • Need to know the benefits and hazards • To justify these methods we must embed in a space much larger than the training set size. • Can decrease margin • Expressive knowledge representations can be used to create features explicitly or in implicit kernel-spaces. • Data representation could allow us to plug in different base kernels to replace matching kernel. • Parameterized kernel allows us to direct the way the feature space is blown up to encode background knowledge.