Statistical Decision-Tree Models for Parsing

Statistical Decision-TreeModels for Parsing NLP lab, POSTECH 김 지 협

Contents • Abstract • Introduction • Decision-Tree Modeling • SPATTER Parsing • Statistical Parsing Models • Decision-Tree Growing & Smoothing • Decision-Tree Training • Experiment Results • Conclusion CS730B

Abstract • Syntactic NL parser: not adequate for highly-ambiguous large-vocabulary text (ex. Wall Street Journal) • Premises for develop a new parser • grammars too complex to develop manually for most domains • parsing models must rely heavily on contextual information • existing n-gram model: inadequate for parsing • SPATTER: a statistical parser based on decision-tree model • better than a grammar-based parser CS730B

Introduction • Parsing as making a sequence of disambiguation decisions • The probability of a complete parse tree(T) of a sentence(S) • Automatically discovering the rules for disambiguation • Producing a parser without a complicated grammar • Long-distance lexical information is crucial to disambiguate interpretations accurately CS730B

Decision-Tree Modeling • Comparison • Grammarian: two crucial tasks for parsing • identifying the features relevant to each decision • deciding which choice to select based on the values of the features • Decision-Tree: above 2 tasks + 3rd task • assigning a probability distribution to the possible choices, and providing a ranking system CS730B

Continued • What is a Statistical Decision Tree? • A decision-making device assigning a probability to each of the possible choices based on the context of the decision • P ( f | h ) , where f : an element of the future vocabulary h : a history (the context of the decision) • The probability determined by asking a sequence of questions • i th question determined by the answers to the i - 1previous question • Example: Part-of-speech tagging problem ( Figure 1 ) CS730B

Continued • Decision Trees vs. n-grams • Equivalent to an interpolated n - gram model in expressive power • Model Parameterization • n -gram model: • n -gram model can be represented by decision-tree model ( n-1 questions ) • Example: part-of-speech tagging CS730B

Continued • Model Estimation • n-gram model CS730B

Continued • decision-tree model • decision-tree model can be represented by interpolated n- gram CS730B

Continued • Why use decision-tree? • As n grows, the parameter space for an n-gram model grows exponentially • On the other hand, the decision-tree learning algorithm increases the size of a model only as the training data allows • So, it can consider much contextual information CS730B

SPATTER Parsing • SPATTER Representation • Parse: as a geometric pattern • 4 features in node: words, tags, labels, and extensions (Figure 3) • The Parsing Algorithm • Starting with the sentence’s words as leaves (Figure 3) • Gradually tagging, labeling, and extending nodes • Constraints • Bottom-up, left-to-right • No new node is constructed until its children completed • Using DWC(derivational window constraints), # of active nodes restricted • A single-rooted, labeled tree is constructed CS730B

Statistical Parsing Models • The Tagging Model • The Extension Model • The Label Model • The Derivation Model • The Parsing Model CS 730B

Decision-Tree Growing & Smoothing • 3 main models (tagging, extension, and label) • Dividing the training corpus into 2 sets: (90% for growing, 10% for smoothing) • Growing & Smoothing Algorithm • Figure 3.5 CS730B

Decision-Tree Training • Parsing model can not be estimated by direct frequency counts because the model contains a hidden component: the derivation model • In the corpus, no information about orders of derivations • So, the training process must process discover which derivations assign higher probability to the parses • Forward-Backward Reestimation used CS730B

Continued • Training Algorithm CS730B

Experiment Results • IBM computer Manual • annotated by the University of Lancaster • 195 part-of-speech tags and 19 non-terminal labels • trained on 30,800 sentences, and tested on 1,473 new sentences • 0-crossing-brackets score • IBM’s rule-based, unification-style PCFG parse: 69% • SPATTER: 76% CS730B

Continued • Wall Street Journal • To test ability to accurately parse a highly-ambiguous, large-vocabulary domain • Annotated in the Penn Treebank, version 2 • 46 part-of-speech tags, and 27 non-terminal labels • Trained on 40,000 sentences, and tested on 1,920 new sentences • Using PARSEVAL CS730B

Conclusion • Large amounts of contextual information can be incorporated into a statistical model for by applying decision-tree learning algorithm • Automatically discovering rules are possible CS730B

Statistical Decision-Tree Models for Parsing