1 / 22

Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network

Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. Kristina Toutanova, Dan Klein, Christopher Manning, Yoram Singer Stanford University The Hebrew University of Jerusalem. Highlights. Just using P(t|w) works even better than you thought—using a better unknown word model

aldan
Download Presentation

Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network Kristina Toutanova, Dan Klein, Christopher Manning, Yoram Singer Stanford University The Hebrew University of Jerusalem

  2. Highlights • Just using P(t|w) works even better than you thought—using a better unknown word model • You can tag really well with no sequence model at all • Conditioning on BOTH left AND right tags yields best published tagging performance • If you are using a maxent model: • Use proper smoothing • Consider more lexicalization • Use conjunctions of features

  3. t-1 t1 w-1 w1 Sequential Classifiers • Learn classifiers for local decisions – predict the tag of a word based on features we like – neighboring words, tags, etc. • Combine the decisions of the classifiers using their output probabilities or scores and chose the best global tag sequence When the dependencies are not cyclic and the classifier is probabilistic, this corresponds to a Bayesian Network (CMM) t0 w0

  4. Experiments for Part-of-Speech Tagging • Data – WSJ 0-18 training, 19-21 dev, 22-24 test • Log-linear models for local distributions • All features are binary and formed by instantiating templates f1(h,t)=1, iff w0=“to” and t=TO (0 otherwise) • Separate feature templates targeted at unknown words -- prefixes, suffixes,etc.

  5. Tagging Without Sequence Information Baseline Three Words t0 t0 w0 w-1 w0 w1 Using words only works significantly better than using the previous two or three tags!

  6. CMM Tagging Models - I • Independence Assumptions of Left-to-Right CMM • ti is independent of t1…ti-2 and w1…wi-1 given ti-1 • ti is independent of all following observations • Similar assumptions in the Right-to-Left CMM • ti is independent of all preceding observations t1 t2 t3 w1 w2 w3 t1 t2 t3 w1 w2 w3

  7. CMM Tagging Models - II • The bad independence assumptions lead to label bias (Bottou 91, Lafferty 01) and observation bias (Klein & Manning 02) will {MD, NN} to {TO} fight {NN, VB, VBP} will will be mis-tagged as MD, because MD is the most common tagging t1 TO t3 P(t1=MD,t2=TO|will,to)= P(MD|will,sos)*P(TO|to,MD)=P(MD|will,sos)*1 will to fight

  8. CMM Tagging Models - III will {MD, NN} to {TO} fight {NN, VB, VBP} In the Right-to-Left CMM, fight will most likely be mis-tagged as NN t1 TO t3 P(t2=TO,t3=NN|to,fight)= P(NN|fight,X)*P(TO|to,NN)= P(NN|fight,X)*1 will to fight

  9. Dependency Networks Conditioning on both left and right tags fixes the problem t1 TO t3 will to fight

  10. Dependency Networks t1 t2 We do not attempt to construct a joint distribution. We classify to the highest scoring sequence Efficient dynamic programming algorithm similar to Viterbi exists for finding the most likely sequence w1 w2

  11. Inference for Linear Dependency Networks ti-1 ti ti+1 ti+2 wi-1 wi wi+1 wi+2

  12. Using Tags: Left Context is Better Baseline Model L Model R t0 t-1 t0 t0 t1 w0 w0 w0 Model L has 13.4% error reduction from Model R

  13. Centered Context is Better L+L2 R+R2 L+R t-2 t-1 t0 t0 t1 t2 t-1 t0 t1 w0 w0 w0 Model L+R has 13.2% error reduction from Model L+L2

  14. Centered Context is Better in the End L+LL+LLL L+LL+LR+R+RR t2 t-3 t-2 t-1 t0 t-2 t-1 t0 t1 w0 w0 15% error reduction due to including right word tags

  15. Lexicalization and More Unknown Word Features L+LL+LR+R+RR+3W t2 t-2 t-1 t0 t1 w-1 w0 w+1

  16. Final Test Results 4.4% error reduction Statistically significant Comparison to best published results – Collins 02

  17. Unknown Word Features • Because we use a conditional model, it is easy to define complex features of the words • A crude company name detector --- the feature is on if the word is capitalized and followed by a company name suffix like Co. or Inc within 3 words. • Conjunctions of character level features – capitalized, contains digit, contains dash, all capitalized, etc. (ex. CFC-12 F/A-18) • Prefixes and suffixes up to length 10

  18. Regularization Helps a Lot Higher accuracy, faster convergence, more features can be added before overfitting

  19. Regularization Helps a Lot Accuracy with and without Gaussian smoothing Effect of reducing feature support cutoffs in smoothed and un-smoothed models

  20. Semantics of Dependency Networks • Let X=(X1,…,Xn). A dependency network for X is a pair (G,P) where G is a cyclic dependency graph and P is a set of probability distributions. • Each node in G corresponds to a variable Xi and the parents of Xi are all nodes Pa(Xi), such that P(Xi | X1,.. Xi-1, Xi+1,.., Xn)= P(Xi |Pa(Xi)) • The distributions in P are the local probability distributions p(Xi |Pa(Xi)). If there exists a joint distribution P(X) such that the conditional distributions in P are derivable from it, then the dependency network is called consistent • For positive distributions P, we can obtain the joint distribution P(X) by Gibbs sampling • Hofmann and Tresp (1997) Heckerman (2000)

  21. Dependency Networks - Problems • The dependency network probabilities learned from data may be inconsistent – there may not be a joint distribution having these conditionals • Even if they define a consistent network, the scoring criterion is susceptible to mutually re-enforcing but unlikely sequences • Suppose we have the following sequence of observations <11,11,12,33> • Most likely state is <11> , but Score(11)=2/3*1=2/3 and Score(33)=1 a b

  22. Conclusions • The use of dependency networks was very helpful for tagging • both left and right words and tags are used for prediction, avoids bad independence assumptions • in training and test, the time/space complexity is the same as for CMMs • Promising for other NLP sequence tasks • More predictive features for tagging • Rich lexicalization further improved accuracy • Conjunctions of feature templates • Smoothing is critical

More Related