A Novel Discourse Parser Based on Support Vector Machine Classification

A Novel Discourse Parser Based onSupport Vector Machine Classification Source: ACL 2009 Author: David A. duVerle and Helmut Prendinger Reporter: Yong-Xiang Chen

Research problem • Automated annotation of a text with RST hierarchically organized relations • To parse discourse • Within the framework of Rhetorical Structure Theory (RST) • Produce a tree-like structure • Based on SVM

Rhetorical Structure Theory (RST) • Mann and Thompson (1988) • A set of structural relations to composing units (‘spans’) of text • 110 distinct rhetorical relations • Relations can be of intentional, semantic, or textual nature • Two-step process (This study focus on step 2) • Segmentation of the input text into elementary discourse units (‘edus’) • Generation of the rhetorical structure tree • the edus constituting its terminal nodes

Edus: • Nucleus • relatively more important part of the text • Satellite • subordinate to the nucleus, represents supporting • information out-going arrow satellite nucleu satellite nucleu

Research restriction • A sequence of edus that have been segmented beforehand • Use the reduced set of 18 rhetorical relations • e.g.: PROBLEM-SOLUTION, QUESTION-ANSWER, STATEMENT-RESPONSE, TOPIC-COMMENT and COMMENT-TOPIC are all grouped under one TOPIC-COMMENT relation • Turned all n-ary rhetorical relations into nested binary relations • e.g.: LIST relation • Only adjacent spans of text can be put in relation within an RST tree (‘Principle of sequentiality’ (Marcu, 2000)

18 rhetorical relations • Attribution, Background, Cause, Comparison, Condition, Contrast, Elaboration, Enablement, Evaluation, Explanation, Joint, Manner-Means, Topic-Comment, Summary, Temporal, Topic- Change, Textual-organization,same-unit

Classifier • Input: given two consecutive spans (atomic edus or RST sub-trees) from input text • Score the likelihood of a direct structural relation as well as probabilities for • a relation’s label • Nuclearity • Gold standard: human cross-validation levels

Two separate classifiers • to train two separate classifiers: • S: A binary classifier, for structure • existence of a connecting node between the two input sub-trees • L: A multi-class classifier, for rhetorical relation and nuclearity labeling

Produce a valid tree • Using these classifiers and a straight-forward bottom-up tree-building algorithm

Classes • 18 super-relations and 41 classes • Considering only valid nuclearity options • e.g., (ATTRIBUTION, N, S) and (ATTRIBUTION, S, N) are two classes of ATTRIBUTION • but not (ATTRIBUTION, N, N)

Reduce the multi-classification • Reduce the multi-classification problem through a set of binary classifiers, each trained either on a single class (“one vs. all”) or by pair (“one vs. one”)

Input data • Annotated documents taken from the RST-DT corpus • paired with lexicalized syntax trees (LS Trees) for each sentence • a separate test set is used for performance evaluation

Lexicalized syntax trees (LS Trees) • Taken directly from the Penn Treebank corpus then “lexicalized” using a set of canonical head-projection rules • tagged with lexical “heads” on each internal node of the syntactic tree

Algorithm • Repeatedly applying the two classifiers and following a naive bottom-up tree-construction method • obtain a globally satisfying RST tree for the entire text • Starts with a list of all atomic discourse sub-trees • made of single edus in their text order • Recursively selects the best match between adjacent sub-trees • using binary classifier S • Labels the newly created sub-tree (using multi-label classifier L) and updates scoring for S, until only one sub-tree is left

Features • ‘S[pan]’ are sub-tree-specific features • Symmetrically extracted from both left and right candidate spans • ‘F[ull]’ are a function of the two sub-trees considered as a pair

Textual Organization • S features: • Number of paragraph boundaries • Number of sentence boundaries • F features: • Belong to same sentence • Belong to same paragraph • Hypothesize a correlation between span length and rhetorical relation • e.g. the satellite in a CONTRAST relation will tend to be shorter than the nucleus • span size and positioning • using either tokens or edus as a distance unit • using relative values for positioning and distance

Lexical Clues and Punctuation • Discourse markers are good indications • Use an empirical n-gram dictionary (for n∈ {1, 2, 3}) built from the training corpus and culled by frequency • Reason: Takes into account non-lexical signals such as punctuation • Counted and encoded n-gram occurrences while considering only the first and last n tokens of each span • Classifier accuracy improved by more than 5%

Simple Syntactic Clues • For achieving better generalization • smaller dependency on lexical content • Add shallow syntactic clues by encoding part-of-speech (POS) tags for both prefix and suffix in each span • length higher than n = 3 did not seem to improve

Dominance Sets • Extract from the syntax parse trees • EX. Difficult to identify the scope of the ATTRIBUTION relation below:

One dominance: Logical nesting order • Logical nesting order: 1A > 1B > 1C • This order allows us to favor the relation between 1B and 1C over a relation between 1A and 1B

Dominance Sets • S features: • Distance to root of the syntax tree • Distance to common ancestor in the syntax tree • Dominating node’s lexical head in span • Relative position of lexical head in sentence • F features: • Common ancestor’s POS tag • Common ancestor’s lexical head • Dominating node’s POS tag (diamonds in Figure ) • Dominated node’s POS tag (circles in Figure ) • Dominated node’s sibling’s POS tag (rectangles in Figure )

Rhetorical Sub-structure • Structural features for large spans (higher-level relations) • Encoding each span’s rhetorical sub-tree into the feature vector

Evaluation • Raw performance of SVM classifiers • Entire tree-building task • Binary classifier S • trained on 52,683 instances • Positive: 1/3, Negative:2/3 • tested on 8,558 instances • classifier L • trained on 17,742 instances • labeled across 41 classes • tested on 2,887 instances Baseline

Baseline: Reitter’s result 2003 • A smaller set of training instances • 7976 v.s. 17,742 in this case • Less classes • 16 rhetorical relation labels with no nuclearity, v.s. to our 41 nuclearized relation classes

Full System Performance • Comparing structure and labeling of the RST tree produced to that manual annotation • perfectly-segmented & SPADE segmenter output • blank tree structure (‘S’) • with nuclearity (‘N’) • with rhetorical relations (‘R’) • fully labeled structure (‘F’)

Comparison with other Algorithms

The End

Background • Coherence relations reflect the authors intent • Hierarchically structured set of Coherence relations • Discourse • Focuses on a higher-level view of text than sentence level

14 • Due to small differences in the way they were tokenized and pre-treated, rhetorical tree and LST are rarely a perfect match: optimal alignment is found by minimizing edit distances between word sequences

Features • Use n-fold validation on S and L classifiers to assess the impact of some sets of features on general performance and eliminate redundant features • ‘S[pan]’ are sub-tree-specific features • Symmetrically extracted from both left and right candidate spans • ‘F[ull]’ are a function of the two sub-trees considered as a pair

Strong Compositionality Criterion

A Novel Discourse Parser Based on Support Vector Machine Classification