170 likes | 336 Views
Using DEDICOM for Completely Unsupervised Part-of-Speech Tagging. NAACL-HLT 2009 Workshop on Unsupervised and Minimally Supervised Learning of Lexical Semantics June 5, 2009 Peter A. Chew, Brett W. Bader Sandia National Laboratories Alla Rozovskaya University of Illinois, Urbana-Champaign.
E N D
Using DEDICOM for Completely Unsupervised Part-of-Speech Tagging NAACL-HLT 2009 Workshop on Unsupervised and Minimally Supervised Learning of Lexical Semantics June 5, 2009 Peter A. Chew, Brett W. Bader Sandia National Laboratories Alla Rozovskaya University of Illinois, Urbana-Champaign Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company,for the United States Department of Energy under contract DE-AC04-94AL85000.
Outline • Previous approaches to part-of-speech (POS) tagging • The DEDICOM model • Testing framework • Preliminary results and discussion
Approaches to POS tagging (1) • Supervised • Rule-based (e.g. Harris 1962) • Dictionary + manually developed rules • Brittle – approach doesn’t port to new domains • Stochastic (e.g. Stolz et al. 1965, Church 1988) • Examples: HMMs, CRFs • Relies on estimation of emission and transition probabilities from a tagged training corpus • Again, difficulty in porting to new domains
Approaches to POS tagging (2) • Unsupervised • All approaches exploit distributional patterns • Singular Value Decomposition (SVD) of term-adjacency matrix (Schütze 1993, 1995) • Graph clustering (Biemann 2006) • Our approach: DEDICOM of term-adjacency matrix • Most similar to Schütze (1993, 1995) Advantages: • can be reconciled to stochastic approaches • like SVD and graph clustering, completely unsupervised • initial results (to be shown) appear promising
Introduction to DEDICOM • DEcomposition into DIrectional COMponents • Harshman (1978) • A linear-algebraic decomposition method comparable to SVD • First used for analysis of marketing data
DEDICOM – an example(domain = shampoo marketing!) Original data matrix Reduced data matrix “Loadings” matrix • DEDICOM decomposes the 8 x 8 matrix into a simplified k x k “summary” (here k = 2), and a matrix showing the loadings for each phrase in each dimension • A key assumption is that stimulus and evoked phrases are a “single set of objects”
DEDICOM – algebraic details • Let X be original data matrix • Let R be reduced matrix of directional relationships • Let A be “loadings” matrix X ARAT • Compare to SVD: X USVT • U, V and A are all dense • But R is dense while S is diagonal, and U V • In SVD, U and V differ; in DEDICOM, A is repeated as AT
DEDICOM – application to POS tagging ‘R’ matrix Term adjacency matrix ‘A’ matrix • The assumption that terms are a “single set of objects”, whether they precede or follow, sets DEDICOM apart from SVD and other unsupervised approaches • This assumption models the fact that tokens play the same syntactic role whether we view them as the first or second element in a bigram
Comparing DEDICOM output to HMM input Output of DEDICOM Input to HMM(after normalization of counts) ‘R’ matrix Transition prob. matrix ‘A’ matrix Emission prob. matrix • The output of DEDICOM is essentially a transition and emission probability matrix • DEDICOM offers the possibility of getting the familiar transition and emission probabilities without training data
Validation: method 1 (theoretical) • Hypothetical example - suppose tagged training corpus exists Corpus: X: sparse matrix of bigram counts A*: term-tag counts R*: tag-adjacency counts • By definition (subject to diff. of 1 for final token): • rowsums of X = colsums of X = rowsums of A* • colsums of A* = rowsums of R* = colsums of R*
Validation: method 1 (theoretical) • To turn A* and R* into transition and emission probability matrices, we simply multiply each by a diagonal matrix D where the entries are the inverses of the rowsum vector • But if the DEDICOM model is a good one, we should be able to multiply A*DR*D(A*)T to approximate the original matrix X • In this case, A*DR*D(A*)T = • This not only does approximate X, but it also captures some syntactic regularities which aren’t instantiated in the corpus (this is one reason HMM-based POS tagging is successful)
Validation: method 2 (empirical) • Use a tagged corpus (CONLL 2000) as gold standard • CONLL 2000 has 19,440 distinct terms • There are 44 distinct tags in the tagset • Tabulate X matrix (solely from bigram frequencies, blind to tags) • Apply DEDICOM to ‘learn’ emission and transition probability matrices • Use these as input to a HMM; tag each token with a numerical index (one of the DEDICOM ‘dimensions) • Evaluate by looking at correlation of induced tags with gold standard tags in a confusion matrix
Validation: method 2 (empirical) • Examples of DEDICOM dimensions or clusters:
Validation: method 2 (empirical) • Confusion matrix: correlation with ‘ideal’ diagonal matrix = 0.494 ideally, the confusion matrix would have one DEDICOM class per ‘gold standard’ tag – either a diagonal matrix or some permutation thereof – although this assumes the gold standard is the optimal tagging scheme
Conclusions • DEDICOM, like other completely unsupervised POS-tagging methods, is hard to evaluate empirically • But we believe it holds promise because: • unlike other unsupervised approaches, it can be reconciled to stochastic approaches (like HMMs) which have a successful track record • unlike traditional stochastic approaches it is truly completely unsupervised • initial objective and subjective results do appear promising
Future work • We believe the key to evaluating DEDICOM, or other methods of POS tagging, is to do so within a larger system • For example, use DEDICOM to disambiguate tokens which are ambiguous w.r.t. part of speech • e.g. ‘claims’ (NN) versus ‘claims’ (VBZ) • Then use this, for example, within an information retrieval system to establish separate indices (rows in a term-by-document matrix) for disambiguated terms • Evaluate based on standard metrices such as precision; see if DEDICOM-based disambiguation results in improved precision
QUESTIONS? POINTS OF CONTACT:Brett W. Bader (bwbader@sandia.gov) Peter A. Chew (pchew@sandia.gov)