Non-compositional Dependencies and Parse Error Detection

Non-compositional Dependencies and Parse Error Detection Colin Cherry Chris Pinchak Ajit Singh

Introduction • What is a non-compositional phrase (NCP) ? • Why do we want to find them ? • Why are they hard to detect ? • Example • Mary exploded a bombshell … • Mary astonished …

Non-compositional Phrase Detection Collocation Database Set of non-compositional dependency triples {(H R M)} Thesaurus Corpus

Problems with parse errors • Hell’s Angels  (angel N:subj:N hell) • Honky Tonk Moon – Randy Travis  (Moon V:obj:N Randy Travis) • Parse errors often labeled non-compositional (incorrectly) • Reinforcing errors if we add them to an idiom list • Makes idiom database unnecessarily large

Methods for Filtering Parse Errors • The Basic Idea • Simple Filter • Hand-tuned Nemesis Lists • Nemesis Classes

The Idea • Every triple contains a word pair and a relationship • Words pairs can participate in multiple relationships • Example: ‘level’ & ‘playing field’ • That really leveled the playing field • (level V:obj:N field) • Now we’re on a level playing field • (field N:mod:A level)

The Idea continued... • Dekang’s software allows us to list all possible ways in which two words can be connected • Some relationships are easily confused • Example: • Hell’s Angels (The angels of hell) • Hell’s Angels (Hell is angels) • Can this help us automatically detect errors?

Simple Filter • Strong (wrong) assumption: • Any two words can be used in exactly one relationship • That relationship will occur most frequently • The filter: • If the relationship present in the dependency isn’t the most frequently used relationship for the two words, label it a parse error

Nemesis Lists • Much weaker assumption: • For any relationship, there is a closed list of other relationships that it can be confused with • If something on this closed list occurs more frequently, label it as a parse error • List for N:subj:N -> {N:gen:N} • If we see “Hell is angels” it might have really been, “Hell’s angels”

Nemesis Classes • Problem with Lists: • Take too long, designed by hand! • Prone to human error / bias • Replace lists with broad classes, where all members of one class have the same nemeses (another class) • We used two classes, one of which had no nemesis (was never filtered)

Our Classes • Class 1 – “Normal Relationships” • Subj, obj, modifier, noun-noun • Class 2 – “Preposition-like Relationships” • Of, for, to, in, under, over (anything not in 1) • Class 1 nemeses = Class 1 • They interfere with each other • Class 2 nemeses = {} • They’re passed automatically

Experimental Design • 100 random idioms, as identified by Dekang’s method • Only one run, as 100 idioms takes approx. 1.5 hours of manual parse examination • Selected after designing the filters • Metrics • Error Ratio = retained parse errors / retained • Precision = filtered parse errors / total filtered • Recall = filtered parse errors / total parse errors

Performance Comparison

Future Work • The whole notion seems to be very limited in terms of recall • Best we could do was 4/13 errors found! • Need more extensive rules: • Confirming relationships • Syria is Lebanon's main power broker, ... • Syria, Lebanon's primary power broker, … • Might allow us to make other rules tougher

More Future Work • Need a way to cut down on manual effort • Labeling is a pain but manually selecting rules is worse • Machine learning methods: • Decision Trees? • What about co-training ?

Conclusions • Nemesis lists provide a high precision, low recall method • Classes provide better recall at the cost of precision • A user who needed to avoid parse errors could choose the method that is most appropriate to his or her task

Non-compositional Dependencies and Parse Error Detection