Kernels for Text Processing

Kernels for Text Processing 류법모 Pum-Mo.Ryu@kaist.ac.kr

Contents • Kernels for Text Processing • Kernels for Text Categorization • Vector Space Kernel • Latent Semantic Kernel • Fisher Kernel • Kernel for Syntactic Parsing • Tree Kernel • Conclusion

Kernels for Text Processing (1) • Kernel is a function K:XxX R, satisfying the following properties • Symmetric • Positive definite • A function that calculates the inner product between mapped texts in a feature space is a kernel function. • For any mapping is a kernel function • Kernel measures similarity of input texts

Kernels for Text Processing (2) • Kernels can be applied to very rich feature spaces provided the inner products can be computed • Text Categorization, Syntactic Parsing • Speech Recognition • DNA, protein sequence • A kernel includes • Function to generate all possible features for a given input • Function to assign values to generated features • Function to compute the inner product between feature spaces

I. Kernels for Text Categorization 1. Text Categorization 2. Vector Space Kernel 3. Semantic Latent Kernel 4. Fisher Kernel

Text Categorization New Document New Document New Document New Document Training Documents (Categorized) Candidate Category System Decision Yes earn No acq Yes money-fx No grain … … • The task of text categorization is assigning documents to one or more predefined categories based on their content. Text Categorizer New Document

Representation of Documents • Usually, a document is represented as a series of feature-value pairs. The features can be arbitrarily abstract (as long as they are easily computable) or very simple. • For example, the features could be the set of all words and the values, their number of occurrences in a particular document. (bag of words model) Japan Firm Plans to Sell U.S. Farmland to Japanese Farmland:1 Firm:1 Japan:1 Japanese:1 Plans:1 Sell:1 To:2 U.S.:1 Representation

Similarity Measures • Most text categorization methods are based on similarity of documents • Cosine Similarity • Euclidian Distance • Kullback-Leibler distance • distance between two probability distributions • Kernel Function

Performance Measures d c b a selected target • Given n test documents, following diagram can be considered. • Precision is a measure of the proportion of selected documents are right. • Recall is the proportion the target documents that the system selected. • There is a trade-off between precision and recall. F-measure can be used for evaluation at fixed cutoffs if both recall and precision are important • Accuracy, error, miss, false-alarm (fallout), break-even point …

Vector Space Kernel - BVSM • Mapping a document to a vector d • each entry records number of times of a particular word stem is used in the document • tens of thousands of entries, extremely sparse • Basic Vector Space Model (BVSM) • K(d1,d2) = d1’d2 • Treats terms as uncorrelated

Vector Space Kernel - GVSM • Generalized Vector Space Model (GVSM) • Document is characterized by its relation to other documents • D = [d1, … dm] : term by document matrix • DD’ : term by term matrix • Non zero i,j entry: one or more documents contain both the i-th and j-th term • K(d1,d2) = (D’d1)’(D’d2) = d1’DD’d2 • D’di : di is represented by its relation to the other documents

Latent Semantic Indexing(1) D • LSI, which is also called as PCA, measures semantic information through co-occurrence analysis in the corpus • Document feature vectors are projected into the subspace spanned by the first k singular vectors of the feature space • Singular Value Decomposition (SVD) of D • D=U∑V’ • U and V are orthogonal • ∑ is a diagonal matrix with the same dimension as D

Latent Semantic Indexing(2) U ∑ V’ • Singular Value Decomposition of D • D = U∑V’ • The singular values are sorted as decreasing order. The highest k singular values are selected to reduce the dimensions. K is determined at the point at which the singular value is drastically decreased.

Latent Semantic Indexing(3) • B = ∑2x2V2xd’ • Documents after rescaling with singular values and reduction to two dimensions • B’B • Document correlation matrix

Latent Semantic Indexing(4) q • Query is projected into reduced dimension and compared existing documents.

Latent Semantic Kernel (1) • Term by document matrix • D = [d1, … dm] • Single Value Decomposition of D • D = U∑V’ • Projection Matrix • P = Uk’ = IkU’ • Ik: the identity matrix with only the first k diagonal elements nonzero • Latent Semantic Kernel • Project documents onto the first k dimensions and calculate inner product between projected documents • K(d1,d2) = (Pd1)’(Pd2) = d1’P’Pd2

Latent Semantic Kernel (2) • SVM classifier trained with a LSKcan perform approximately the same as the baseline method even with 200 dimensions

Fisher Kernel (1) • Kernel function derived from a generative probability model • Consider a family of generative models P(X|q), smoothly parameterized by q=(q1,…,qr) • Fisher Score • The gradient of the log-likelihood of P(x|q) • Fisher information matrix • Fisher Kernel

Fisher Kernel (2) • Memoryless Information Source • D = {d1, … , dN} : set of documents • W = {w1, … ,wM} : set of words • A document di can be viewed as a probability distribution over word sequence • A document dican be represented by multinomial probability distribution P(wj|di), which denotes the prob. that a generic word occurrence in document di will be wj. • where n(di|wj) is number of times word wj occurred in document di

Fisher Kernel (3) • Latent Class Analysis • Observable variables: D, W • Unobserved class variable: Z = {z1, … ,zK} • Unobserved class variable zk is associated with each observation i.e. with each word occurrence (di, wj) • Joint prob. model over D x W • Conditional independence assumption • diand wj are independent conditioned on the state of the associated latent variable • Two possible parameters • P(zk), P(wj|zk)

Fisher Kernel (4) • Average log-probability of a document di • Document Model • The prob. of all the word occurrences in di normalized by document length • Average log-probability of a document di

Fisher Kernel (5) • EM algorithm for model fitting • EM algorithm is an iterative solution to the following circular statements to find parameters that maximize l(di) • An Expectation (E) step, where posterior probabilities are computed for the latent variables • A Maximization (M) step, where parameters are updated. n(di|wj) denote the number of occurrences of word wj in document di

Fisher Kernel (6) • Two possible parameters: P(zk), P(wj|zk) • Fisher information matrix can be approximated by the identity matrix. • Kernel when parameter is P(zk) • Kernel when parameter is P(wj|zk)

Fisher Kernel (7) • Classification errors

II. Kernel for Syntactic Parsing 1. Syntactic Parsing 2. Tree Kernel

Syntactic Parsing S N VP V NP Lou Gerstner PP is N P N chairman of IBM • Syntactic parsing is the process to identify the syntactic structure of sentences given a grammar. • Making syntactic trees for input sentences • Grammars • Context Free Grammar (Phrase Structure Grammar) • Dependency Grammar • Grammar (CFG) • S  N VP • VP  V NP • NP  N PP • N chairman | Lou Gerstner | IBM • PP  P N • V  is • P  of

Syntactic Ambiguity S S N VP N VP V NP astronomers astronomers VP PP PP saw N N V P N P N boys saw boys with telescope with telescope • Many syntactic trees can be generated for a input sentence using given grammar • Sometimes, it is very difficult to resolve the ambiguities in sentence level • Grammar rules don’t have enough information to resolve the ambiguities t1: t2:

Probabilistic Context Free Grammar (PCFG) S S N VP N VP t1: 1.0 t2: 1.0 V NP astronomers astronomers VP PP 0.1 0.3 0.1 0.5 PP saw N N V P N 1.0 0.3 P N boys saw boys with telescope 0.2 1.0 0.18 1.0 with telescope 1.0 1.0 1.0 1.0 1.0 0.18 • PCFG, which assigns probability values to grammar rules, is used to resolve syntactic ambiguity (Collins 99) • Generate all possible syntactic trees of input sentence. • Score of a syntactic tree is calculated by multiplying prob. values of all grammar rules applied to the tree • Select the most probable (the highest score) tree for the sentence CFG with Prob. SN VP 1.0 VPV NP 0.3 VPVP PP 0.7 …

Linear Model for Parsing (1) • This model assigns scores to candidate syntactic trees of input sentence, and select the most probable tree • Training data : {si, ti} where si is sentence and ti is the correct tree for that sentence • Set of candidate trees for a particular sentence si • C(si)={xi1, xi2, …}, xi1 =ti • Each candidate xijis represented by a feature vector h(xij) in the space Rn. • Score of a tree xij • The output of the model on a training or test example s is

Linear Model for Parsing (2) • Score function should satisfy the condition • SVM can be formulated as a search αi,j which determine the optimal weights of model parameter vector • Score of a parse tree can be calculated as follows Kernel function

Tree Kernels (1) • Tree kernel is a kind of convolution kernel • Recursive calculation over the ‘parts’ of a discrete structure • String kernel is also a kind of convolution kernel • T : a parse tree • hs(T) : the number of times the s’th sub-tree in T • A sub-tree also should be generated by grammar rule Sub-trees of the NP coveringthe man

Tree Kernels (2) woman woman woman • Tree T is represented as • h(T) = {h1(T), h2(T), … , hd(T)} • Kernel for trees • K(T1,T2) = <h(T1),h(T2)> = ∑s hs(T1)hs(T2) •  intractable : exponential number of sub-trees Sub-tree of T1: K(T1,T2) = (1x1)+(1x1) = 2 Sub-tree of T2:

Tree Kernels (3) • Efficient Tree Kernel • Is(n) : Indicator function • 1 : if sub-tree s is seen rooted at node n • 0 : otherwise • N1, N2 : set of nodes in trees T1 and T2 • C(n1, n2) : number of common sub-trees rooted at both n1 and n2

Tree Kernels (4) • C(n1, n2) : number of common sub-trees rooted at both n1 and n2 • If productions at n1 and n2 are different then C(n1, n2)=0 • Else if productions at n1 and n2 are the same, and both n1 and n2 are pre-terminals, then C(n1, n2)=1 • Else if productions at n1 and n2 are the same and n1 and n2 not pre-terminals • nc(n) : the number of non-terminals directly below n in the tree • ch(n,i) : the i th child-node of n1 • Pre-terminals are nodes directly above words in the surface string • Production at n is the grammar rule applied to make n

Tree Kernels (5) • Example of C(n1, n2) • <h(T1),h(T2)> can be calculated in O(|N1| |N2|) • Experimental Result LR/LP=labeled recall/precision

Conclusions • Kernels for text processing are introduced • Vector space kernel • Latent semantic kernel • Fisher kernel • Tree kernel • Kernels can be applied to very rich feature spaces provided the inner products can be computed • Kernels can be applied other text processing areas • POS tagging • Calculating word co-occurrence information • Word sense disambiguation • Word clustering

References • Thomas Hofmann (2000), “Learning the Similarity of Documents: An Information-Geometric Approach to Document Retrieval and Categorization”, In Advances in Neural Information Processing Systems 12, MIT Press, 2000 • Tommi S. Jaakkola and David Haussler (1998), “Exploiting generative models in discriminative classifiers”, In Advances in Neural Information Processing Systems 11, MIT Press, 1998 • Nello Cristianini et. al (2002), “Latent Semantic Kernels”, Journal of Intelligent Information Systems, vol. 18, no. 2 • Michael Collins, Nigel Duffy (2001), “Convolution Kernels for Natural Language”, NIPS 2001 • Huma Lodhi et. al (2002), “Text Classification using String Kernels”, Journal of Machine Learning Research, vol. 2, 2002 • Thorsten Joachims (1998), “Text Categorization with Support Vector Machines: Learning with Many Relevant Features”, in Proceedings of the European Conference on Machine Learning, 1998 • Christopher D. Manning and Hinrich Schutze (1999), “Latent Semantic Indexing”, chapter 15.4 of Foundations of Statistical Natural Language Processing, MIT Press, 1999 • Collins, M. (1999), “Head-Driven Statistical Models for Natural Language Parsing”, PhD Dissertation, University of Pennsylvania

Kernels for Text Processing

Kernels for Text Processing

Presentation Transcript

Text Processing

TEXT PROCESSING 1

Basic Text Processing

Text processing

TEXT PROCESSING UTILITIES

Text Processing

Using ontologies for text processing

Text Processing

Advanced Text Processing

Text processing

Kernels

Text Processing

Kernels

Advanced Text Processing

Kernels

Text Processing

Text processing

Text processing