1 / 54

SVM Anchored Learning and Model Tampering (a case study in Hebrew NP chunking)

SVM Anchored Learning and Model Tampering (a case study in Hebrew NP chunking). Toward Better Understanding of Hebrew NP Chunks. Yoav Goldberg and Michael Elhadad Ben Gurion University of the Negev, Israel. Lexicalization.

creynaldo
Download Presentation

SVM Anchored Learning and Model Tampering (a case study in Hebrew NP chunking)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SVM Anchored Learning and Model Tampering(a case study in Hebrew NP chunking) Toward Better Understanding of Hebrew NP Chunks Yoav Goldberg and Michael Elhadad Ben Gurion University of the Negev, Israel

  2. Lexicalization • Once upon a time: number of features one could use in models was quite low • Enter discriminative models: we can now incorporate millions of features! • Adding lexical information as features helps accuracy in many tasks • We add lexical information everywhere. Everyone is happy. ISCOL 2007

  3. Lexicalization What specific problems do these lexical features solve? Are all these features equally important? • Once upon a time: a limit on the number of features one could use in his models. • Enter discriminative models: we can now incorporate millions of features! • Adding lexical information as features is shown to helps accuracy. • We add lexical information everywhere. Everyone is happy. ISCOL 2007

  4. Continuation of Our Previous Work • Hebrew NP Chunking (ACL 2006) • Define “Hebrew Simple NPs” (traditional base NP definition does not work for Hebrew) and derive them from a Hebrew Treebank. • SVM based approach. • Morphological (construct and number) features help identify Simple NP chunks • Lexical Features are crucial for Hebrew. ISCOL 2007

  5. NLP-Machine Learning Workflow Find a Task  Get a Corpus  Annotate it Represent it as a ML problem Decide on features Decide on a learning algorithm Encode the features Learn a model Evaluate ISCOL 2007

  6. NLP-Machine Learning Workflow (Hebrew) NP Chunking Find a Task  Get a Corpus  Annotate it Represent it as a ML problem Decide on features Decide on a learning algorithm Encode the features  Learn a model Evaluate Use Treebank and derive annotation from it B-I-O Tagging SVM, Poly kernel Binary feature vector SVMLight, YAMCHA, … ISCOL 2007

  7. NLP-Machine Learning Workflow (Hebrew) NP Chunking Find a Task  Get a Corpus  Annotate it Represent it as a ML problem Decide on features Decide on a learning algorithm Encode the features  Learn a model Inspect the resulting model Use Treebank and derive annotation from it B-I-O Tagging SVM, Poly kerlnel Binary feature vector SVMLight, YAMCHA, … ISCOL 2007

  8. Workflow How important are specific features? What is hard to learn? Locate corpus errors Is our task definition consistent? (Hebrew) NP Chunking Find a Task  Get a Corpus  Annotate it Represent as ML problem Decide on features Decide on a learning algorithm Encode features  Learn a model Inspect the resulting model Use Treebank and derive annotation from it B-I-O Tagging SVM, Poly kerlnel Binary feature vector SVMLight, YAMCHA, … ISCOL 2007

  9. Overview – SVM Learning • Binary supervised classifier • Input: labeled examples encoded as vectors (yi{-1,+1}, xi Rn), kernel function K, C • Magic (for this talk) • Output: weighted support vectors (subset of input vectors) • Decision function: ISCOL 2007

  10. Feature Vectors ( 1 iff w0 is ‘dog’, 1 iff w0 is ‘cat’ , …, 1 iff p0 is VB, …, 1 iff p+2 is NN, … ) Very high dimension yet very sparse vectors. Features whose values are 0 in all the SVs do not (directly) affect the classification. ISCOL 2007

  11. Multiclass • SVM is a binary classifier, in order to do 3-class classification, 3 classifiers are learned: B/I B/O I/O ISCOL 2007

  12. SVM Model • An SVM model is the collection of support vectors and their corresponding weights. • Many weighted vectors – but what does it mean? • Enter “Model Tampering” ISCOL 2007

  13. Model Tampering • Artificially force selected features in the Support Vectors to 0. • Evaluate tampered model on a test set, to learn the importance of these features for the classification. • Note: this is not the same as learning without these features. ISCOL 2007

  14. Our Datasets • HEBGold • NP Chunks corpus derived from HebTreebank,~5000 sentences, perfect POS tags, chunk definition as in (Goldberg et al, 2006). • HEBErr • Same as HEBGold, with ~8% POS tag errors • ENG • Ramshaw and Marcus’s NP chunks data, ~11,000 sentences, ~ 4% POS tag errors. ISCOL 2007

  15. Tamperings (1/3) • TopN – For each lexical feature, count the number of SVs where it is active. Keep only the top N lexical features according to this rank. ISCOL 2007

  16. Tamperings (1/3) Near top performance with only 1000 lexical features • TopN – For each lexical feature, count the number of SVs where it is active. Keep only the top N lexical features according to this rank. ISCOL 2007

  17. Tamperings (1/3) Near top performance with only 1000 lexical features • TopN – For each lexical feature, count the number of SVs where it is active. Keep only the top N lexical features according to this rank. With perfect POS tags, even 500 is more than enough (some words may hurt us) ISCOL 2007

  18. Tamperings (1/3) Near top performance with only 1000 lexical features • TopN – For each lexical feature, count the number of SVs where it is active. Keep only the top N lexical features according to this rank. With perfect POS tags, even 500 is more than enough (some words may hurt us) For Hebrew, 10 lexical features are REALLY important ISCOL 2007

  19. Tamperings (2/3) • NoPOS – remove all lexical features corresponding to a given part-of-speech. ISCOL 2007

  20. Tamperings (2/3) • NoPOS – remove all lexical features corresponding to a given part-of-speech. Prepositions and punctuations are most important. ISCOL 2007

  21. Tamperings (2/3) • NoPOS – remove all lexical features corresponding to a given part-of-speech. Prepositions and punctuations are most important. Closed class are more important than open class. ISCOL 2007

  22. Tamperings (2/3) • NoPOS – remove all lexical features corresponding to a given part-of-speech. Prepositions and punctuations are most important. Closed class are more important than open class. Adverbs are hard for the POS tagger. ISCOL 2007

  23. ISCOL Bonus The 4 most important Hebrew nouns were: % כלל ש"ח דרך Tamperings • NoPOS – remove all lexical features corresponding to a given part-of-speech. Prepositions and punctuations are most important. Closed class are more important than open class. Adverbs are hard for the POS tagger. ISCOL 2007

  24. Top10 Lexical Features in Hebrew • Start of Sentence Marker • Comma • Quote • of / של • and / ו • the / ה • in / ב ISCOL 2007

  25. Top10 Lexical Features in Hebrew • Start of Sentence Marker • Comma • Quote • of / של • and / ו • the / ה • in / ב של/of is different than the other prepositions in Hebrew with respect to chunk boundaries. ISCOL 2007

  26. Top10 Lexical Features in Hebrew • Start of Sentence Marker • Comma • Quote • of / של • and / ו • the / ה • in / ב Quotes and commas are important. We know they are somewhat inconsistent in TB. Goldberg et al. 2006  normalize punctuation before evaluation This work  normalize punctuation before learning improves F score by ~0.8 (10-fold CV) של/of is different than the other prepositions in Hebrew with respect to chunk boundaries. ISCOL 2007

  27. Tamperings (3/3) • Loc=i –keep only lex features with index i. ISCOL 2007

  28. Tamperings (3/3) • Loc=i –keep only lex features with index i. • Loci –keep only lex features with indexi. ISCOL 2007

  29. Lexical features at position 0 (current word) are most important. Tamperings (3/3) • Loc=i –keep only lex features with index i. • Loci –keep only lex features with indexi. ISCOL 2007

  30. Lexical features at position 0 (current word) are most important. Top0 tampering (Removing all lexical features) yield somewhat better results (90.1) Tamperings (3/3) • Loc=i –keep only lex features with index i. • Loci –keep only lex features with indexi. ISCOL 2007

  31. Lexical features at position 0 (current word) are most important. Top0 tampering (Removing all lexical features) yield somewhat better results (90.1) Tamperings (3/3) • Loc=i –keep only lex features with index i. • Loci –keep only lex features with indexi. This is better than with all the features (93.79) (yet learning without it to begin with is worse) ISCOL 2007

  32. Intuitively… • The SVM learner uses rare, irrelevant features (i.e.,word at location –2 is X and POS at location 2 is Y) to memorize hard cases. • This rote learning helps generalization performance by focusing the learner on the “easy” cases… • …but overfits on the hard events. ISCOL 2007

  33. Anchored Learning • Add a unique feature (ai – anchor) to each training sample (as many features as there are samples) • Data is linearly separable. • Anchors “remove the burden” from “real” features. • Anchors with high weights correspond to the “Hard to Learn” cases. • “Hard to Learn” cases are either corpus errors, or genuinely hard (both are interesting). ISCOL 2007

  34. Anchors vs. Previous Work on Corpus Error Detection Come To Prague • Most relevant work: • Boosting (Abney et. al. 1998): “hard to learn examples in an AdaBoost model are candidate corpus errors” • AdaBoost models are easy to interpret. • SVM and AdaBoost models are different. ISCOL 2007

  35. Anchors vs. Previous Work on Corpus Error Detection Come To Prague • Nakagawa and Matsumoto (2002): • Support Vectors with high αi values are “exceptional cases” • Look for similar examples with different label to extract contrastive pairs. • Our method: • Finds the errors directly. • Has better recall. • Converges in reasonable time even when there are many errors. • Allows learning without “important” features. ISCOL 2007

  36. Anchored Learning Results (1/2) • Identified corpus errors with high precision. • Some of the corpus errors found were actually errors in the process of deriving chunks from the Hebrew TreeBank • Identified problematic aspects with the NP chunk definition used in previous work, triggering a revision of the definition. • Identified some hard cases (multi-word expressions, adverbial usage, conjunctions) ISCOL 2007

  37. ISCOL Bonus: Problems with Definition of NP Chunks [גוונים חמים] כמו [אדומים], [כתומים] ו [חומים] Are these really NP chunks? Where are the nouns? ISCOL 2007

  38. ISCOL Bonus: Problems with Definition of NP Chunks את was included in the chunks  [את הממשלה, הכנסת, בית המשפט והתקשורת] ISCOL 2007

  39. ISCOL Bonus: Problems with Definition of NP Chunks Some determiners can be very complex: [ו אולי אף יותר פעמים ] ISCOL 2007

  40. ISCOL Bonus: Problems with Definition of NP Chunks של was considered as unambiguous, but: [נשיא בית הדין] ל [משמעת] של [המשטרה] The ל preposition is also interesting. ISCOL 2007

  41. ISCOL Bonus: Problems with Definition of NP Chunks סמיכות + של + ל/מ מציבה בעיות מאד קשות להגדרה של ביטויי NP פשוטים בעברית. אין זמן לעבור על זה כאן, אבל אשמח מאד לדבר אתכם על זה אח"כ! של was considered as unambiguos, but: [נשיא בית הדין] ל [משמעת] של [המשטרה] The ל preposition is also interesting. ISCOL 2007

  42. ISCOL Bonus: What’s hard in NP Chunking The prepositions של and מ Conjunctions: מערכת ה עבודה ה שכר ו ה איגוד ה מקצועי Some adverbs/adjectives: ה[אבדה] ל[משפחה] גדולה Multiword expressions (and prepositions): פה אחד, בכל מקרה, בבת אחת, כך או כך, לכל היותר... ISCOL 2007

  43. Anchored Learning (2/2) • Current-Word lexical features are the most important. • What are the contextual lexical features used for? ISCOL 2007

  44. Anchored Learning (2) • What are the contextual lexical features used for? • Learn 3 models: • Mfull – all lexical features • Mnear – without features w-2/w+2 • Mno-cont – with only the w0 lexical feature • Compare the hard cases in the models, to find the role of features w-1/w+1, w-2/w+2. w-2 w-1 w0 w1 w2 w-2 w-1 w0 w1 w2 w-2 w-1 w0w1 w2 ISCOL 2007

  45. Anchored Learning (2) • What are the contextual lexical features used for? • Learn 3 models: • Mfull – all lexical features • Mnear – without features w-2/w+2 • Mno-cont – with only the w0 lexical feature • Compare the hard cases in the models, to find the role of features w-1/w+1, w-2/w+2. (Anchors guarantee convergence of learning process in reasonable time) w-2 w-1 w0 w1 w2 w-2 w-1 w0 w1 w2 w-2 w-1 w0w1 w2 ISCOL 2007

  46. Mfull < Mnear< Mno-cont H Hard cases: Anchored Learning (2) Hard cases solved by Mnear • What are the contextual lexical features used for? • Learn 3 models: • Mfull – all lexical features • Mnear – without features w-2/w+2 • Mno-cont – with only the w0 lexical feature • Compare the hard cases in the models, to find the role of features w-1/w+1, w-2/w+2. (Anchors guarantee convergence of learning process in reasonable time) w-2 w-1 w0 w1 w2 w-2 w-1 w0 w1 w2 w-2 w-1 w0w1 w2 Hard cases solved by Mfull ISCOL 2007

  47. Qualitative Results • Contextual lexical features contribute mostly to disambiguating: • Conjunctions • Appositions • Attachment of Adverbs and Adjectives • Some multi-word expressions ISCOL 2007

  48. Quantitative Results ISCOL 2007

  49. Quantitative Results w-1/w+1 solves about 5 times more hard cases than w-2/w+2 ISCOL 2007

  50. Quantitative Results w-1/w+1 solves about 5 times more hard cases than w-2/w+2 Contextual lexical features are very important for learning back-to-back NPs. ISCOL 2007

More Related