380 likes | 605 Views
Chunking. Pierre Bourreau Cristina España i Bonet LSI-UPC PLN-PTM. Plan . Introduction Methods HMM SVM CRF Global analysis Conclusion. Introduction. What is chunking? Identifying groups of contiguous words. Ex: He is the person you read about.
E N D
Chunking Pierre Bourreau Cristina España i Bonet LSI-UPC PLN-PTM
Plan • Introduction • Methods • HMM • SVM • CRF • Global analysis • Conclusion
Introduction • What is chunking? • Identifying groups of contiguous words. • Ex: • He is the person you read about. • [NP He] [VP is] [NP the person] [NP you] [VP read] [PP about]. • First step to full parsing
Introduction • Chunking task in CoNLL • Based on a previous POS tagging • Chunks: • B/I/O-Chunk • ADJP (adjective phrase) • ADVP (adverb phrase) • CONJP (conjunction phrase) • INTJ (interjection) • LST (list marker) • NP (noun phrase) • PP (prepositional phrase) • PRT (particles) • SBAR (subordinated clause) • UCP (unlike coordinated phrase) • VP (verb phrase) • 2.060 • 4.227 • 56 • 31 • 10 • 55.081 • 21.281 • 556 • 2.207 • 2 • 21.467 ( over 106.978 chunks )
Introduction • Corpus • Wall Street Journal (WSJ) • Training set: • Four sections: 15-18 • 211.727 tokens • Test set: • One section: 20 • 47.377 tokens
Evaluation • Output files style • Word POS Real-Chunk Processed-Chunk • Ex: • Boeing NNP B-NP I-NP • 's POS B-NP B-NP • 747 CD I-NP I-NP • jetliners NNS I-NP I-NP • . . O O • Evaluation’s script: precision, recall, F1 score • (1+β)*recall*precision/(recall+ βprecision) where β=1
Plan • Introduction • Methods • HMM • SVM • CRF • Global analysis • Conclusion
All the states?! First order Second order Bigrams Trigrams Hidden Markov Models (HMM) • A bit of theory... • Find the most probable tags for a sentence (I), given a vocabulary and the set of possible tags Bayes theorem
HMM: Chunking • Common Setting • Input sentence: words • Input/output tags: POS • Chunking • Input sentence: POS • Input/output tags: Chunks Tagger • Problem! • Small vocabulary
HMM: Chunking • Solution: Specialization • Input sentence: POS • Input/output tags: POS + Chunks • Improvement • Input sentence: Special words + POS • Input/output tags: Special words + POS + Chunks
Chancellor NNP O of IN B-PP the DT B-NP Exchequer NNP I-NP Nigel NNP B-NP Lawson NNP I-NP 's POS B-NP restated VBN I-NP commitment NN I-NP to TO B-PP a DT B-NP firm NN I-NP monetary JJ I-NP policy NN I-NP has VBZ B-VP helped VBN I-VP to TO I-VP prevent VB I-VP a DT B-NP freefall NN I-NP in IN B-PP sterling NN B-NP over IN B-PP the DT B-NP past JJ I-NP week NN I-NP . . O NNP NNP·O of·IN of·IN·B-PP the·DT the·DT·B-NP NNP NNP·I-NP NNP NNP·B-NP NNP NNP·I-NP POS POS·B-NP VBN VBN·I-NP NN NN·I-NP to·TO to·TO·B-PP a·DT a·DT·B-NP NN NN·I-NP JJ JJ·I-NP NN NN·I-NP has·VBZ has·VBZ·B-VP helped·VBN helped·VBN·I-VP to·TO to·TO·I-VP VB VB·I-VP a·DT a·DT·B-NP NN NN·I-NP in·IN in·IN·B-PP NN NN·B-NP over·IN over·IN·B-PP the·DT the·DT·B-NP past·JJ past·JJ·I-NP NN NN·I-NP . .·O HMM: Chunking • In practice: Modification of the Input data (WSJ train and test)
HMM: Results • Tool: • TnT Tagger (Thorsten Brants) • Implements Viterbi algorithm for second order MM • Allows to evaluate unigrams, bigrams and trigrams MM
HMM: Results • Configuration 1: • No special words, no POS • 3grams • Default parameters • Results: • Far from the best scores (F1~94%)
HMM: Results Trying to improve… • Configuration 2: • Lexical specialization (409 words, F. Pla) • Trigrams • Configuration 3: • Lexical specialization (409 words, F. Pla) • Bigrams -makes any difference?-
HMM: Results • Comments: • Adding specialization information improves 7 points the total F1. • That’s much more that the improvement of using trigrams instead of bigrams (~1%). • As before, NP and PP are the best determined chunks. • Impressive improvement for PRT and SBAR (but small #).
HMM: Results • Importance of the training set size: • Test: • Divide the training set in 7 parts (~17000 tokens/part). • Calculate the results adding a part each time. • Conclusion: • Performances improve with the set size (see plot). • Limit? • Molina & Pla got a F1=93.26% with 18 sections of WSJ as training set.
Support Vector Machines (SVM) • A bit of theory… • Objective: • Maximize the minimum margin • Allow missclassifications • Controlled by the C parameter
SVM • Tool: • SVMTool (Jesús Giménez & Lluís Màrquez) • Uses SVMLight(Thorsten Joachims) for learning. • Sequential tagger chunking • No necessity to change input data • Binarizes the problem to apply SVMs
SVM • Features (model 0):
SVM • Results • Default parameters, vary C and/or direction (LR/LRL) • Very small variations with this configuration
SVM • Best results: • F1 > 90% for the three main chunks. • Modest values for the others. • Main difference with HMM in PP.
Conditional Random Fields (CRF) • A bit of theory… • Idea based on extension of HMM and Maximum-Entropy Models. • We don’t consider a chain but a graph G=(V,E) • Conditioned on X, observation sequence variable • Each node represents a value Yv of Y (output label)
Conditional Random Fields • P(y|x) (Lafferty ) where y is a label, and x an observation sequence • tj is a transition feature function (regarding previous features and observation sequence). • sj is a state feature function (regarding current features and observation sequence). • factors are set at the training level.
Conditional Random Fields (CRF) • CRF++ 0.45 • Developed by Taku Kudo (2nd at the CoNll2000 with SVM combination) • Parameters: • Features being used: • We can use words, POS tagging • We proposed three alternatives: • Using a binary combinations of words+POS on a frame size=2 • Using the above + a 3-ary combination of POS • Using only POS on a 3 size frame • Unigrams or Bigrams • Getting a score regarding probabilities for our current OR for the pair of words.
Conditional Random Fields (CRF) • Results
Conditional Random Fields (CRF) • Analysis: • Bigrams with maximum features -> 93.81% Global F1 • Global F1-score does not depend much on feature window, but on bigram/unigram selection: tagging pairs of tokens give more power than single tagging • Ocurrences: LST->0, INTJ->1, CONJP->9 => Identical resuts • PRT is the only POS tag which depends more on feature window and works better for size 2 windows. Prepositions tagging rely on bigger windows (ex: out, around, in, …) • Slightly the same for SBAR. (ex: than, rather, …)
Conditional Random Fields (CRF) • How to improve results: • Molina & Pla’s method? Should improve efficiency in SBAR and CONJP • Mixing the different methods?
Plan • Introduction • Methods • HMM • SVM • CRF • Global analysis • Conclusion
Global Analysis • CRF outperforms HMM and SVM • HMM performs better than SVM <= context. Particularly evident for SBAR and PRT. • HMM performs outperforms CRF for CONJP! • HMM: uses 3-grams -> better for expression like “as well as”, “rather than” • HMM improvement with Pla’s method
Global Analysis • CRF results are close to CoNll 2000 best results: • Need finest analysis, per POS
Gobal Anaysis • Combining the three methods:
Global Analysis • Combining does not help for PRPT, where the difference was big between HMM and CRF! • Helps… just a bit on SBAR • Global results are better for CRF alone: 93.81>93.57
Plan • Introduction • Methods • HMM • SVM • CRF • Global analysis • Conclusion
Conclusion • Without exicalization, SVM performs a lot better than HMM • With lexical specialization, HMM performs better than SVM… and is a lot faster! • Only 3 experiments for votation: few. Taggers make mistakes for the same POS tags.
Conclusion • At a certain stage, hard to improve results. • CRF proves to be efficient without any specific modification -> how can we improve it? => CRF with 3-grams… but probably really slow. • Some fine comparisons with CoNll results?