1 / 38

Chunking

Chunking. Pierre Bourreau Cristina España i Bonet LSI-UPC PLN-PTM. Plan . Introduction Methods HMM SVM CRF Global analysis Conclusion. Introduction. What is chunking? Identifying groups of contiguous words. Ex: He is the person you read about.

leda
Download Presentation

Chunking

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chunking Pierre Bourreau Cristina España i Bonet LSI-UPC PLN-PTM

  2. Plan • Introduction • Methods • HMM • SVM • CRF • Global analysis • Conclusion

  3. Introduction • What is chunking? • Identifying groups of contiguous words. • Ex: • He is the person you read about. • [NP He] [VP is] [NP the person] [NP you] [VP read] [PP about]. • First step to full parsing

  4. Introduction • Chunking task in CoNLL • Based on a previous POS tagging • Chunks: • B/I/O-Chunk • ADJP (adjective phrase) • ADVP (adverb phrase) • CONJP (conjunction phrase) • INTJ (interjection) • LST (list marker) • NP (noun phrase) • PP (prepositional phrase) • PRT (particles) • SBAR (subordinated clause) • UCP (unlike coordinated phrase) • VP (verb phrase) • 2.060 • 4.227 • 56 • 31 • 10 • 55.081 • 21.281 • 556 • 2.207 • 2 • 21.467 ( over 106.978 chunks )

  5. Introduction • Corpus • Wall Street Journal (WSJ) • Training set: • Four sections: 15-18 • 211.727 tokens • Test set: • One section: 20 • 47.377 tokens

  6. Evaluation • Output files style • Word POS Real-Chunk Processed-Chunk • Ex: • Boeing NNP B-NP I-NP • 's POS B-NP B-NP • 747 CD I-NP I-NP • jetliners NNS I-NP I-NP • . . O O • Evaluation’s script: precision, recall, F1 score • (1+β)*recall*precision/(recall+ βprecision) where β=1

  7. Plan • Introduction • Methods • HMM • SVM • CRF • Global analysis • Conclusion

  8. All the states?! First order Second order Bigrams Trigrams Hidden Markov Models (HMM) • A bit of theory... • Find the most probable tags for a sentence (I), given a vocabulary and the set of possible tags Bayes theorem

  9. HMM: Chunking • Common Setting • Input sentence: words • Input/output tags: POS • Chunking • Input sentence: POS • Input/output tags: Chunks  Tagger • Problem! • Small vocabulary

  10. HMM: Chunking • Solution: Specialization • Input sentence: POS • Input/output tags: POS + Chunks • Improvement • Input sentence: Special words + POS • Input/output tags: Special words + POS + Chunks

  11. Chancellor NNP O of IN B-PP the DT B-NP Exchequer NNP I-NP Nigel NNP B-NP Lawson NNP I-NP 's POS B-NP restated VBN I-NP commitment NN I-NP to TO B-PP a DT B-NP firm NN I-NP monetary JJ I-NP policy NN I-NP has VBZ B-VP helped VBN I-VP to TO I-VP prevent VB I-VP a DT B-NP freefall NN I-NP in IN B-PP sterling NN B-NP over IN B-PP the DT B-NP past JJ I-NP week NN I-NP . . O NNP NNP·O of·IN of·IN·B-PP the·DT the·DT·B-NP NNP NNP·I-NP NNP NNP·B-NP NNP NNP·I-NP POS POS·B-NP VBN VBN·I-NP NN NN·I-NP to·TO to·TO·B-PP a·DT a·DT·B-NP NN NN·I-NP JJ JJ·I-NP NN NN·I-NP has·VBZ has·VBZ·B-VP helped·VBN helped·VBN·I-VP to·TO to·TO·I-VP VB VB·I-VP a·DT a·DT·B-NP NN NN·I-NP in·IN in·IN·B-PP NN NN·B-NP over·IN over·IN·B-PP the·DT the·DT·B-NP past·JJ past·JJ·I-NP NN NN·I-NP . .·O HMM: Chunking • In practice: Modification of the Input data (WSJ train and test)

  12. HMM: Results • Tool: • TnT Tagger (Thorsten Brants) • Implements Viterbi algorithm for second order MM • Allows to evaluate unigrams, bigrams and trigrams MM

  13. HMM: Results • Configuration 1: • No special words, no POS • 3grams • Default parameters • Results: • Far from the best scores (F1~94%)

  14. HMM: Results Trying to improve… • Configuration 2: • Lexical specialization (409 words, F. Pla) • Trigrams • Configuration 3: • Lexical specialization (409 words, F. Pla) • Bigrams -makes any difference?-

  15. HMM: Results

  16. HMM: Results • Comments: • Adding specialization information improves 7 points the total F1. • That’s much more that the improvement of using trigrams instead of bigrams (~1%). • As before, NP and PP are the best determined chunks. • Impressive improvement for PRT and SBAR (but small #).

  17. HMM: Results • Importance of the training set size: • Test: • Divide the training set in 7 parts (~17000 tokens/part). • Calculate the results adding a part each time. • Conclusion: • Performances improve with the set size (see plot). • Limit? • Molina & Pla got a F1=93.26% with 18 sections of WSJ as training set.

  18. HMM: Results

  19. Support Vector Machines (SVM) • A bit of theory… • Objective: • Maximize the minimum margin • Allow missclassifications • Controlled by the C parameter

  20. SVM • Tool: • SVMTool (Jesús Giménez & Lluís Màrquez) • Uses SVMLight(Thorsten Joachims) for learning. • Sequential tagger  chunking • No necessity to change input data • Binarizes the problem to apply SVMs

  21. SVM • Features (model 0):

  22. SVM • Results • Default parameters, vary C and/or direction (LR/LRL) • Very small variations with this configuration

  23. SVM • Best results: • F1 > 90% for the three main chunks. • Modest values for the others. • Main difference with HMM in PP.

  24. Conditional Random Fields (CRF) • A bit of theory… • Idea based on extension of HMM and Maximum-Entropy Models. • We don’t consider a chain but a graph G=(V,E) • Conditioned on X, observation sequence variable • Each node represents a value Yv of Y (output label)

  25. Conditional Random Fields • P(y|x) (Lafferty ) where y is a label, and x an observation sequence • tj is a transition feature function (regarding previous features and observation sequence). • sj is a state feature function (regarding current features and observation sequence). • factors are set at the training level.

  26. Conditional Random Fields (CRF) • CRF++ 0.45 • Developed by Taku Kudo (2nd at the CoNll2000 with SVM combination) • Parameters: • Features being used: • We can use words, POS tagging • We proposed three alternatives: • Using a binary combinations of words+POS on a frame size=2 • Using the above + a 3-ary combination of POS • Using only POS on a 3 size frame • Unigrams or Bigrams • Getting a score regarding probabilities for our current OR for the pair of words.

  27. Conditional Random Fields (CRF) • Results

  28. Conditional Random Fields (CRF) • Analysis: • Bigrams with maximum features -> 93.81% Global F1 • Global F1-score does not depend much on feature window, but on bigram/unigram selection: tagging pairs of tokens give more power than single tagging • Ocurrences: LST->0, INTJ->1, CONJP->9 => Identical resuts • PRT is the only POS tag which depends more on feature window and works better for size 2 windows. Prepositions tagging rely on bigger windows (ex: out, around, in, …) • Slightly the same for SBAR. (ex: than, rather, …)

  29. Conditional Random Fields (CRF) • How to improve results: • Molina & Pla’s method? Should improve efficiency in SBAR and CONJP • Mixing the different methods?

  30. Plan • Introduction • Methods • HMM • SVM • CRF • Global analysis • Conclusion

  31. Global Analysis

  32. Global Analysis • CRF outperforms HMM and SVM • HMM performs better than SVM <= context. Particularly evident for SBAR and PRT. • HMM performs outperforms CRF for CONJP! • HMM: uses 3-grams -> better for expression like “as well as”, “rather than” • HMM improvement with Pla’s method

  33. Global Analysis • CRF results are close to CoNll 2000 best results: • Need finest analysis, per POS

  34. Gobal Anaysis • Combining the three methods:

  35. Global Analysis • Combining does not help for PRPT, where the difference was big between HMM and CRF! • Helps… just a bit on SBAR • Global results are better for CRF alone: 93.81>93.57

  36. Plan • Introduction • Methods • HMM • SVM • CRF • Global analysis • Conclusion

  37. Conclusion • Without exicalization, SVM performs a lot better than HMM • With lexical specialization, HMM performs better than SVM… and is a lot faster! • Only 3 experiments for votation: few. Taggers make mistakes for the same POS tags.

  38. Conclusion • At a certain stage, hard to improve results. • CRF proves to be efficient without any specific modification -> how can we improve it? => CRF with 3-grams… but probably really slow. • Some fine comparisons with CoNll results?

More Related