380 likes | 705 Views
Kernels for Text Processing. 류법모 Pum-Mo.Ryu@kaist.ac.kr. Contents. Kernels for Text Processing Kernels for Text Categorization Vector Space Kernel Latent Semantic Kernel Fisher Kernel Kernel for Syntactic Parsing Tree Kernel Conclusion. Kernels for Text Processing (1).
E N D
Kernels for Text Processing 류법모 Pum-Mo.Ryu@kaist.ac.kr
Contents • Kernels for Text Processing • Kernels for Text Categorization • Vector Space Kernel • Latent Semantic Kernel • Fisher Kernel • Kernel for Syntactic Parsing • Tree Kernel • Conclusion
Kernels for Text Processing (1) • Kernel is a function K:XxX R, satisfying the following properties • Symmetric • Positive definite • A function that calculates the inner product between mapped texts in a feature space is a kernel function. • For any mapping is a kernel function • Kernel measures similarity of input texts
Kernels for Text Processing (2) • Kernels can be applied to very rich feature spaces provided the inner products can be computed • Text Categorization, Syntactic Parsing • Speech Recognition • DNA, protein sequence • A kernel includes • Function to generate all possible features for a given input • Function to assign values to generated features • Function to compute the inner product between feature spaces
I. Kernels for Text Categorization 1. Text Categorization 2. Vector Space Kernel 3. Semantic Latent Kernel 4. Fisher Kernel
Text Categorization New Document New Document New Document New Document Training Documents (Categorized) Candidate Category System Decision Yes earn No acq Yes money-fx No grain … … • The task of text categorization is assigning documents to one or more predefined categories based on their content. Text Categorizer New Document
Representation of Documents • Usually, a document is represented as a series of feature-value pairs. The features can be arbitrarily abstract (as long as they are easily computable) or very simple. • For example, the features could be the set of all words and the values, their number of occurrences in a particular document. (bag of words model) Japan Firm Plans to Sell U.S. Farmland to Japanese Farmland:1 Firm:1 Japan:1 Japanese:1 Plans:1 Sell:1 To:2 U.S.:1 Representation
Similarity Measures • Most text categorization methods are based on similarity of documents • Cosine Similarity • Euclidian Distance • Kullback-Leibler distance • distance between two probability distributions • Kernel Function
Performance Measures d c b a selected target • Given n test documents, following diagram can be considered. • Precision is a measure of the proportion of selected documents are right. • Recall is the proportion the target documents that the system selected. • There is a trade-off between precision and recall. F-measure can be used for evaluation at fixed cutoffs if both recall and precision are important • Accuracy, error, miss, false-alarm (fallout), break-even point …
Vector Space Kernel - BVSM • Mapping a document to a vector d • each entry records number of times of a particular word stem is used in the document • tens of thousands of entries, extremely sparse • Basic Vector Space Model (BVSM) • K(d1,d2) = d1’d2 • Treats terms as uncorrelated
Vector Space Kernel - GVSM • Generalized Vector Space Model (GVSM) • Document is characterized by its relation to other documents • D = [d1, … dm] : term by document matrix • DD’ : term by term matrix • Non zero i,j entry: one or more documents contain both the i-th and j-th term • K(d1,d2) = (D’d1)’(D’d2) = d1’DD’d2 • D’di : di is represented by its relation to the other documents
Latent Semantic Indexing(1) D • LSI, which is also called as PCA, measures semantic information through co-occurrence analysis in the corpus • Document feature vectors are projected into the subspace spanned by the first k singular vectors of the feature space • Singular Value Decomposition (SVD) of D • D=U∑V’ • U and V are orthogonal • ∑ is a diagonal matrix with the same dimension as D
Latent Semantic Indexing(2) U ∑ V’ • Singular Value Decomposition of D • D = U∑V’ • The singular values are sorted as decreasing order. The highest k singular values are selected to reduce the dimensions. K is determined at the point at which the singular value is drastically decreased.
Latent Semantic Indexing(3) • B = ∑2x2V2xd’ • Documents after rescaling with singular values and reduction to two dimensions • B’B • Document correlation matrix
Latent Semantic Indexing(4) q • Query is projected into reduced dimension and compared existing documents.
Latent Semantic Kernel (1) • Term by document matrix • D = [d1, … dm] • Single Value Decomposition of D • D = U∑V’ • Projection Matrix • P = Uk’ = IkU’ • Ik: the identity matrix with only the first k diagonal elements nonzero • Latent Semantic Kernel • Project documents onto the first k dimensions and calculate inner product between projected documents • K(d1,d2) = (Pd1)’(Pd2) = d1’P’Pd2
Latent Semantic Kernel (2) • SVM classifier trained with a LSKcan perform approximately the same as the baseline method even with 200 dimensions
Fisher Kernel (1) • Kernel function derived from a generative probability model • Consider a family of generative models P(X|q), smoothly parameterized by q=(q1,…,qr) • Fisher Score • The gradient of the log-likelihood of P(x|q) • Fisher information matrix • Fisher Kernel
Fisher Kernel (2) • Memoryless Information Source • D = {d1, … , dN} : set of documents • W = {w1, … ,wM} : set of words • A document di can be viewed as a probability distribution over word sequence • A document dican be represented by multinomial probability distribution P(wj|di), which denotes the prob. that a generic word occurrence in document di will be wj. • where n(di|wj) is number of times word wj occurred in document di
Fisher Kernel (3) • Latent Class Analysis • Observable variables: D, W • Unobserved class variable: Z = {z1, … ,zK} • Unobserved class variable zk is associated with each observation i.e. with each word occurrence (di, wj) • Joint prob. model over D x W • Conditional independence assumption • diand wj are independent conditioned on the state of the associated latent variable • Two possible parameters • P(zk), P(wj|zk)
Fisher Kernel (4) • Average log-probability of a document di • Document Model • The prob. of all the word occurrences in di normalized by document length • Average log-probability of a document di
Fisher Kernel (5) • EM algorithm for model fitting • EM algorithm is an iterative solution to the following circular statements to find parameters that maximize l(di) • An Expectation (E) step, where posterior probabilities are computed for the latent variables • A Maximization (M) step, where parameters are updated. n(di|wj) denote the number of occurrences of word wj in document di
Fisher Kernel (6) • Two possible parameters: P(zk), P(wj|zk) • Fisher information matrix can be approximated by the identity matrix. • Kernel when parameter is P(zk) • Kernel when parameter is P(wj|zk)
Fisher Kernel (7) • Classification errors
II. Kernel for Syntactic Parsing 1. Syntactic Parsing 2. Tree Kernel
Syntactic Parsing S N VP V NP Lou Gerstner PP is N P N chairman of IBM • Syntactic parsing is the process to identify the syntactic structure of sentences given a grammar. • Making syntactic trees for input sentences • Grammars • Context Free Grammar (Phrase Structure Grammar) • Dependency Grammar • Grammar (CFG) • S N VP • VP V NP • NP N PP • N chairman | Lou Gerstner | IBM • PP P N • V is • P of
Syntactic Ambiguity S S N VP N VP V NP astronomers astronomers VP PP PP saw N N V P N P N boys saw boys with telescope with telescope • Many syntactic trees can be generated for a input sentence using given grammar • Sometimes, it is very difficult to resolve the ambiguities in sentence level • Grammar rules don’t have enough information to resolve the ambiguities t1: t2:
Probabilistic Context Free Grammar (PCFG) S S N VP N VP t1: 1.0 t2: 1.0 V NP astronomers astronomers VP PP 0.1 0.3 0.1 0.5 PP saw N N V P N 1.0 0.3 P N boys saw boys with telescope 0.2 1.0 0.18 1.0 with telescope 1.0 1.0 1.0 1.0 1.0 0.18 • PCFG, which assigns probability values to grammar rules, is used to resolve syntactic ambiguity (Collins 99) • Generate all possible syntactic trees of input sentence. • Score of a syntactic tree is calculated by multiplying prob. values of all grammar rules applied to the tree • Select the most probable (the highest score) tree for the sentence CFG with Prob. SN VP 1.0 VPV NP 0.3 VPVP PP 0.7 …
Linear Model for Parsing (1) • This model assigns scores to candidate syntactic trees of input sentence, and select the most probable tree • Training data : {si, ti} where si is sentence and ti is the correct tree for that sentence • Set of candidate trees for a particular sentence si • C(si)={xi1, xi2, …}, xi1 =ti • Each candidate xijis represented by a feature vector h(xij) in the space Rn. • Score of a tree xij • The output of the model on a training or test example s is
Linear Model for Parsing (2) • Score function should satisfy the condition • SVM can be formulated as a search αi,j which determine the optimal weights of model parameter vector • Score of a parse tree can be calculated as follows Kernel function
Tree Kernels (1) • Tree kernel is a kind of convolution kernel • Recursive calculation over the ‘parts’ of a discrete structure • String kernel is also a kind of convolution kernel • T : a parse tree • hs(T) : the number of times the s’th sub-tree in T • A sub-tree also should be generated by grammar rule Sub-trees of the NP coveringthe man
Tree Kernels (2) woman woman woman • Tree T is represented as • h(T) = {h1(T), h2(T), … , hd(T)} • Kernel for trees • K(T1,T2) = <h(T1),h(T2)> = ∑s hs(T1)hs(T2) • intractable : exponential number of sub-trees Sub-tree of T1: K(T1,T2) = (1x1)+(1x1) = 2 Sub-tree of T2:
Tree Kernels (3) • Efficient Tree Kernel • Is(n) : Indicator function • 1 : if sub-tree s is seen rooted at node n • 0 : otherwise • N1, N2 : set of nodes in trees T1 and T2 • C(n1, n2) : number of common sub-trees rooted at both n1 and n2
Tree Kernels (4) • C(n1, n2) : number of common sub-trees rooted at both n1 and n2 • If productions at n1 and n2 are different then C(n1, n2)=0 • Else if productions at n1 and n2 are the same, and both n1 and n2 are pre-terminals, then C(n1, n2)=1 • Else if productions at n1 and n2 are the same and n1 and n2 not pre-terminals • nc(n) : the number of non-terminals directly below n in the tree • ch(n,i) : the i th child-node of n1 • Pre-terminals are nodes directly above words in the surface string • Production at n is the grammar rule applied to make n
Tree Kernels (5) • Example of C(n1, n2) • <h(T1),h(T2)> can be calculated in O(|N1| |N2|) • Experimental Result LR/LP=labeled recall/precision
Conclusions • Kernels for text processing are introduced • Vector space kernel • Latent semantic kernel • Fisher kernel • Tree kernel • Kernels can be applied to very rich feature spaces provided the inner products can be computed • Kernels can be applied other text processing areas • POS tagging • Calculating word co-occurrence information • Word sense disambiguation • Word clustering
References • Thomas Hofmann (2000), “Learning the Similarity of Documents: An Information-Geometric Approach to Document Retrieval and Categorization”, In Advances in Neural Information Processing Systems 12, MIT Press, 2000 • Tommi S. Jaakkola and David Haussler (1998), “Exploiting generative models in discriminative classifiers”, In Advances in Neural Information Processing Systems 11, MIT Press, 1998 • Nello Cristianini et. al (2002), “Latent Semantic Kernels”, Journal of Intelligent Information Systems, vol. 18, no. 2 • Michael Collins, Nigel Duffy (2001), “Convolution Kernels for Natural Language”, NIPS 2001 • Huma Lodhi et. al (2002), “Text Classification using String Kernels”, Journal of Machine Learning Research, vol. 2, 2002 • Thorsten Joachims (1998), “Text Categorization with Support Vector Machines: Learning with Many Relevant Features”, in Proceedings of the European Conference on Machine Learning, 1998 • Christopher D. Manning and Hinrich Schutze (1999), “Latent Semantic Indexing”, chapter 15.4 of Foundations of Statistical Natural Language Processing, MIT Press, 1999 • Collins, M. (1999), “Head-Driven Statistical Models for Natural Language Parsing”, PhD Dissertation, University of Pennsylvania