Automatic Grammar Induction and Parsing Free Text - Eric Brill

Automatic Grammar Induction and Parsing Free Text - Eric Brill 1998. 11. 12. Thur. POSTECH Dept. of Computer Science 9425021 심 준 혁

Abstract Eric Brill Dept. of Computer and Information Science University of Pennsylvania • Transformation-Based Approach for PS using the automatic induction of natural language grammar. • Learning a set of ordered transformations which reduce parsing error • parsing text into Syntactic B-tree with non-Terminals Unlabelled • Applied To • 1) POS Tagging, 2) PP-Attachment, 3) Word Classification • Related Research • Automatically acquiring phrase structure using distributional Analysis • A transformation based approach to prepositional phrase attachment • A simple rule-based part of speech tagger CS730B Statistical NLP

Contents • Introduction • Transformation-Based Error-Driven Learning • Learning Phrase Structure • Experimental Results • Conclusions CS730B Statistical NLP

1. Introduction • New Approach for Grammar Induction Problem • Referenced Corpus : Penn Treebank, WSJ, ATIS corpus. • Merits • System Implementation Simplicity • Process Efficiency • A small set of Transformation Rule • A small set of Training Corpus • Relative Accuracy is good • Robust to noise or unfamiliar input ( than CFG-Based approach ) • Defects • Time Complexity in proportion to Sentence Length • OVERTRAINING Problem CS730B Statistical NLP

Phrase Structure Learning Algorithm Initial State Naively Annotating Text. POS Tagging : Most Likely Tag. PP-Attachment : Low. Word Classification : Nouns. Learning State Comparison to the Truth. : manually annotated Corpus. Making the Transformation : RULE Added to the list of transformation. Sentences tagged with parts of Speech and returning a B-tree Structure with Nonterminals unlabelled. 2.Transformation-Based Error-Driven Learning Unannotated Text Initial State Annotated Text [ Truth ] Corpus Data Learning PS Rules CS730B Statistical NLP

3. Learning Phrase Structure • Initial State of parser • Right branching parenthesis. • Final punctuation is attached high. • [Ex] : (( The ( dog ( and ( old ( cat ate ) ) ) ) ) . ) • Structural Transformations • Transformation Type • (1-8) ; (Add/Delete) a (Left/Right) parenthesis to the (Left/Right) ofPOS Tag “X” • (9-12) ; (Add/Delete) a (Left/Right) parenthesis between tags X and Y • Example :: (( The ( dog barked ) ) . ) • Delete a left parenthesis to the right of “X” • Add a right parenthesis to the right of “YY” • Add a right parenthesis to the right of “Noun” CS730B Statistical NLP

3.1. Examples • “Delete the left parenthesis to the right of “determiner” • Inits0 ( ( The ( dog barked ) ) . ) • (step1) Delete the left paren to the right of deternminer ( ( The # dog barked ) ) . ) • (step2) Delete the right paren that matches the just deleted paren ( ( The dog barked # ) . ) • (step3) Add a left paren to the left of the constituent immediately to the left of the deleted left paren ( ( ( The dog barked ) . ) • (step4) Add a right paren to the right of the constituent immediately to the right of the deleted left paren ( ( ( The dog ) barked ) ) . ) • If there is no constituent immediately to the right, or none immediately to the left, then the transformation fails to apply (redundancy) CS730B Statistical NLP

3.2. Learning Transformation • Process • Initialization with naïve parser • Applying the 12 transformation templates to the sentence • Best Transformation is found for the structures output by the parser in its current state (가장 많은 변화를 주는 일반적인 “변형”을 찾는다.) • Transformation is applied to the output resulting from bracketing the corpus using the parser in its current state • Transformation is added to the end of the Ordered list of transformation • Looping until no transformation found CS730B Statistical NLP

(continued) • Learning Transformation Application • Parsing the fresh text • Naïve parsing  List of best scored transformation applied • Measure of Success :: Percentage of Constituent (PoC) • comparison to the correct PS description of training corpus. • from sentences output by our system which do not cross any constituents in the Penn Treebank structural description of the sentence. • ( ( ( The big ) ( dog ate ) ) . )  ( ( ( The big dog ) ate ) . )  PoC = 2/4 • Example • Best Scored “7” Transformation in WSJ Corpus • Mostly “Noun phrases extraction” Transformation • ( ( The ( cat meowed ) ) . )  ( ( The cat ) meowed ) . ) • ( ( We ( ran ( , ( and (they walked ) ) ) ) . )  ( ( We ran ) ( , ( and (they walked ) ) ) ) . ) CS730B Statistical NLP

4. Results • ATIS corpus (Test Corpus 1) • training corpus = 21% size / Sentence Length = 11.3 (words)  “p222와 비교” • No crossing constituents = 60% • Fewer than two crossing constituents = 74% • Fewer than three crossing constituents = 85% • (Fig2) Percentage correct as a function of the number of transformations • OVERTRAINING by specifically learned TR = small percent TS • Solution = Set the Threshold :: specify the min level of improvements CS730B Statistical NLP

( Continued ) • Random binary branching structure initialization • drop the initial right-linear assumption with final punctuation high • Total 147 Transformation and 87.13% bracketing accuracy • WSJ corpus (More complex corpus) • Table 2, Table 3, Table 4 • Inside-Outside Algorithm • 90.2% in “1095” 1-15-Sentence (11.3word) • Sentence Length Bandwidths  • Number of Transformation  , Bracketing accuracy  . • Training Corpus Size  • Number of Transformation  , Bracketing accuracy  . • Random binary branching structure initialization ( “250” 2-15-Sentence ) • Total 325 Transformation and 84.72% bracketing accuracy • Sentence Length distribution • Figure. 3 CS730B Statistical NLP

5. Conclusion • New Approach to learning a grammar to automatically parse text  “Transformation Template & Induced Rule” • The result is relatively high accuracy and effective (weakly statistical) • Next Project : Automatically Non-terminal labeling Algorithm • Advanced Transformation Procedure Experiments CS730B Statistical NLP

Automatic Grammar Induction and Parsing Free Text - Eric Brill