430 likes | 688 Views
Asma Naseer. Chunking Shallow Parsing. Introduction. Shallow Parsing or Partial Parsing At first proposed by Steven Abney (1991) Breaking text up into small pieces Each piece is parsed separately [1]. Introduction (continue . . . ).
E N D
AsmaNaseer ChunkingShallow Parsing
Introduction • Shallow Parsing or Partial Parsing • At first proposed by Steven Abney (1991) • Breaking text up into small pieces • Each piece is parsed separately [1]
Introduction (continue . . . ) • Words are not arranged flatly in a sentence but are grouped in smaller parts called phrases The girl was playing in the street اس نے احمد کو کتاب دی
Introduction(continue . . . ) • Chunks are non-recursive (does not contain a phrase of the same category as it self) • NP D? AdjP? AdjP? N The big red balloon [NP[D The][AdjP [Adjbig]] [AdjP [Adjred]][N balloon]] [1]
Introduction(continue . . . ) • Each phrase is dominated by a head h A man proud of his son. A proud man • The root of the chunk has h as s-head (semantic head) • Head of a Noun phrase is usually a Noun or pronoun [1] [1]
Chunk tagging (continue . . .) • IOBE • IOB • IO
Chunk Tagging • IOB (Inside Outside Begin) • I-NP O-NP B-NP • I-VP O-VP B-BP قائد اعظم محمد علی جناح نے قوم سے خطاب کیا [جناحI-NP] [علیI-NP] [ محمد I-NP] [قائد اعظم B-NP] [خطاب B-NP] [سے O-NP ] [قومB-NP] [نےO-NP] [کیا O-NP]
research work • Rule Based Vs Statistical Based Chunking [2] • Use of Support Vector Learning for Chunk Identification [5] • A Context Based Maximum Likelihood Approach to Chunking [6] • Chunking with Maximum Entropy Models [7] • Single-Classifier Memory-Based Phrase Chunking [8] • Hybrid Text Chunking [9] • Shallow Parsing as POS Tagging [3]
Rule Based Vs Statistical Based Chunking • Two techniques are used • Regular expressions rules • Shallow Parse based on regular expressions • N-gram statistical tagger (machine based chunking) • NLTK (Natural Language Toolkit) based on TnT Tagger (Trigramsb’n’Tags). • Basic Idea: Reuse POS tagger for chunking.
Rule Based Vs Statistical Based Chunking (continue… ) Regular expressions rules • Necessary to develop regular expressions manually N-gram statistical tagger • Can be trained on gold standard chunked data
Rule Based Vs Statistical Based Chunking (continue… ) • Focus is on Verb and Noun phrase chunking • Noun Phrases • Noun or pronoun is the head • Also contains • Determiners i.e. Articles, Demonstratives, Numerals, Possessives and Quantifiers • Adjectives • Complements ( ad-positional, relative clauses ) • Verb Phrases • Verb is the head • Often one or two complements • Any number of Adjuncts
Rule Based Vs Statistical Based Chunking (continue… ) • Training NLTK on Chunk Data • Starts with empty rule set • 1. Define or refine a rule • 2. Execute chunker on training data • 3. Compare results with previous run • Repeat (1,2 & 3) until performance does not improve significantly • Issues: Total 211,727 phrases. Taken subset 1,000 phrases.
Rule Based Vs Statistical Based Chunking (continue… ) • Training TnT on Chunk Data • Chunking is treated as statistical tagging • Two steps • Parameter generation : create model parameters from training corpus • Tagging : tag each word with chunk label
Rule Based Vs Statistical Based Chunking (continue… ) • Data Set • WSJ: Wall Street Journal Newspaper NY • US • International Business • Financial News • Training: section 15-18 • Testing: section 20 • Both tagged with POS and IOB • Special characters are treated as other POS, punctuation are tagged as O
Rule Based Vs Statistical Based Chunking (continue… ) • Results • Precision P = |reference ∩ test| / test • Recall R = |reference ∩ test| / reference • F- Measure Fα = 0.5 = 1 / (α/P + (1-α)/PR) • F- Rate F = (2 * P* R) / (R+P)
Rule Based Vs Statistical Based Chunking (continue… ) • Results • NLTK • TnT
Use of Support Vector Learning for Chunk Identification • SVM (Large Margin Classifiers) • Introduced by Vapnik 1995 • Two class pattern recognition problem • Good generalization performance • High accuracy in text categorization without over fitting (Joachims, 1998; Taira and Haruono, 1999)
Use of Support Vector Learning for Chunk Identification ( continue… ) • Training data • (xi, yi)…. (xl, yl) xiЄRn, yiЄ {+1, -1} • xi is the i-th sample represented by n dimensional vector • yi is (+ve or –ve class) label of i-th sample • In SVM • +ve and –ve examples are separated by a hyperplane • SVM finds optimal hyperplane
Use of Support Vector Learning for Chunk Identification ( continue… ) • Two possible hyperplanes
Use of Support Vector Learning for Chunk Identification ( continue… ) • Chunks in CoNLL-2000 shared task, are IOB Tagged • Each chunk type belongs to either I or B • I-NP or B-NP • 22 types of chunks are found in CoNLL-2000 • Chunking problem is classification of these 22 types • SVM is binary classifier, so its extended to k-classes • One class vs. all others • Pairwise classification • k * (k-1) / 2 classifiers 22 * 21 / 2 = 231 classifiers • Majority decides final class
Use of Support Vector Learning for Chunk Identification ( continue… ) • Feature vector consists of • Words: w • POS tags: t • Chunk tags: c • To identify chunk ci at i-th word • wj, tj (j = i-2, i-1, i, i+1, i+2) • cj (j = i-2, i-1) • All features are expanded to binary values; either 0 or 1 • The total dimensions of feature vector becomes 92837
Use of Support Vector Learning for Chunk Identification ( continue… ) Results • It took about 1 day to train 231 classifiers • PC-Linux • Celeron 500 MHz, 512 MB • ADJP, ADVP, CONJP, INTJ, LST, NP, PP, PRT, SBAR, VP • Precision = 93.45 % • Recall = 93.51 % • Fβ=1 = 93.48 %
A Context Based Maximum Likelihood Approach to Chunking Training • POS Tags based • Construct symmetric n-context from training corpus • 1-context: most common chunk label for each tag • 3-context: tag followed by the tag before and after it [t-1, t0, t+1] • 5-context [t-2 ,t-1, t0, t+1, t+2] • 7-context [t-3 , t-2 ,t-1, t0, t+1, t+2, t+3]
A Context Based Maximum Likelihood Approach to Chunking (continue . . .) Training • For each context find the most frequent label • CC [O CC] • PRP CC RP [B-NP CC] • To save storage space n-context is added if its different from its nearest lower order context
A Context Based Maximum Likelihood Approach to Chunking (continue . . .) Testing • Construct maximum context for each tag • Look up in the database of most likely patterns • If the largest context is not found context is diminished step by step • The only rule for chunk-labeling is to look up [t-3 , t-2 ,t-1, t0, t+1, t+2, t+3] .… [t0] until the context is found
A Context Based Maximum Likelihood Approach to Chunking (continue . . .) Results • The best results are achieved for 5-context • ADJP, ADVP, CONJP, INTJ, LST, NP, PP, PRT, SBAR, VP • Precision = 86.24% • Recall = 88.25% • Fβ=1 = 87.23%
Chunking with Maximum Entropy Models • Maximum Entropy models are exponential models • Collect as much information as possible • Frequencies of events relevant to the process • MaxEnt model has the form P(w|h) = 1 / Z(h) . eΣiλifi(h,w) • fi(h,w) is a binary valued featured vector describing an event • λi describes how important is fi • Z(h) is a normalization factor
Chunking with Maximum Entropy Models (contune . . .) Attributes Used • Information in WSJ Corpus • Current Word • POS Tag of Current Word • Surrounding Words • POS Tags of Surrounding Words • Context • Left Context: 3 words • Right Context: 2 words • Additional Information • Chunk tags of previous 2 words
Chunking with Maximum Entropy Models (contune . . .) Results • Tagging Accuracy = 95.5% # of correct tagged words Total # of words • Recall = 91.86% # of correct proposed base NPs Number of correct base NPs • Precision = 92.08% # of correct proposed base NPs Number of proposed base NPs • Fβ=1 = 91.97% (β 2 +1). Recall .Precision β2 . (Recall + Precision)
Hybrid Text Chunking • Context based Lexicon and HMM based chunker • Statistics were used for chunking by Church(1998) • Corpus frequencies were used • Non-recursive noun phrases were identified • Skut & Brants (1998) modifeid Church approach and used Viterbi Tagger
Hybrid Text Chunking (continue . . .) • Error-driven HMM based text chunker • Memory is decreased by keeping only +ve lexical entries • HMM based text chunker with context-dependent lexicon • Given Gn1 = g1, g2,. . ., gn • Find optimal sequence Tn1 = t1, t2, . . ., tn • Maximize log P( Tn1| Gn1 ) log P( Tn1| Gn1) = log P(Tn1) + log P( Tn1, Gn1) P( Tn1 ) P ( Gn1 )
Shallow Parsing as POS Tagging • CoNLL 2000 : for testing and training • Ratnaparkhi’s maximum entropy based POS tagger • No change in internal operation • Information for training is increased
Shallow Parsing as POS Tagging (continue . . .) Shallow Parsing VS POS Tagging • Shallow Parsing requires more surrounding POS/lexical syntactic environment • Training Configurations • Words w1 w2 w3 • POS Tags t1 t2 t3 • Chunk Types c1 c2 c3 • Suffixes or Prefixes
Shallow Parsing as POS Tagging (continue . . .) • Amount of information is gradually increased • Word w1 • Tag t1 • Word, Tag, Chunk Label (w1 t1 c1) • Current chunk label is accessed through another model with configurations of words and tags (w1 t1) • To deal with sparseness • t1, t2 • c1 • c2 (last two letters) • w1 (first two letters)
Shallow Parsing as POS Tagging (continue . . .) • (w1 t1 c1)
Shallow Parsing as POS Tagging (continue . . .) • Sparseness Handling
Shallow Parsing as POS Tagging (continue . . .) • Over all Results
Shallow Parsing as POS Tagging (continue . . .) Error Analysis • Three groups of errors • Difficult syntactic constructs • Punctuations • Treating di-transitive VPs and transitive VPs • Adjective vs. Adverbial Phrases • Mistakes made in training or testing by annotator • Noise • POS Errors • Odd annotation decisions • Errors peculiar to approach • Exponential Distribution assigns non zero probability to all events • Tagger may assign illegal chunk-labels (I-NP while w is not NP)
Shallow Parsing as POS Tagging (continue . . .) Comments • PPs are easy to identify • ADJP and ADVP are hard to identify correctly (more syntactic information is required) • Performance at NPs can be further improved • Performance using w1 or t1 is almost same. Using both the features enhances performance
References • [1] Philip Brooks, “A Simple Chunk Parser”, May 8, 2003. • [2] Igor Boehm, “Rule Based vs. Statistical Chunking of CoNLL data Set”. • [3] Miles Osborne, “Shallow Parsing as POS Tagging” • [4] Hans van Halteren, “Chunking with WPDV Models” • [5] TakuKudoh and Yuji Matsumoto, “Use of Support Vector Learning for Chunk Identification”, In proceeding of CoNLL-2000 and LLL-2000, page 142-144, Portugal 2000. • [6] ChristerJohanson, “A Context Sensitive Maximum Likelihood Approach to Chunking” • [7] Rob Koeling, “Chunking with Maximum Entropy Models” • [8] JornVeenstra and Antal van den Bosch, “Single Cassifier Memory Based Phrase Chunking” • [9] Guo dong Zhou and Jian Su and TongGuanTey, “Hybrid Text Chunking”