210 likes | 374 Views
Toward Dependency Path based Entailment. Rodney Nielsen, Wayne Ward, and James Martin. Dependency Path-based Entailment. DIRT (Lin and Pantel, 2001) Unsupervised method to discover inference rules “X is author of Y ≈ X wrote Y” “X solved Y ≈ X found a solution to Y”
E N D
Toward Dependency Path based Entailment Rodney Nielsen, Wayne Ward, and James Martin
Dependency Path-based Entailment • DIRT (Lin and Pantel, 2001) • Unsupervised method to discover inference rules • “X is author of Y ≈ X wrote Y” • “X solved Y ≈ X found a solution to Y” • If two dependency paths tend to link the same sets of words, they hypothesize that their meanings are similar
ML Classification Approach Dependency Path Based Entailment • Features derived from corpus statistics • Unigram co-occurrence • Surface form bigram co-occurrence • Dependency-derived bigram co-occurrence • Mixture of experts: • About 18 ML classifiers from Weka toolkit • Classify by majority vote or average probability Bag of Words Graph Matching
Corpora • 7.4M articles, 2.5B words, 347 words/doc • Gigaword (Graff, 2003) – 77% of documents • Reuters Corpus (Lewis et al., 2004) • TIPSTER • Lucene IR engine • Two indices • Word surface form • Porter stem filter • Stop words = {a, an, the}
Hypothesis h Text t rising choke cost is Newspapers on The of costs paper rising paper and revenues falling Dependency Features • Dependency bigram features
Hypothesis h Text t rising choke cost is Newspapers on The of costs paper rising paper and revenues falling Dependency Features • Descendent relation statistics
Hypothesis h Text t rising choke cost is Newspapers on of The costs paper rising paper and revenues falling Dependency Features • Descendent relation statistics
Hypothesis h Text t rising choke cost is Newspapers on The of costs paper rising paper and revenues falling Dependency Features • Descendent relation statistics
Hypothesis h Text t rising choke cost is Newspapers on The of costs paper rising paper and revenues falling Dependency Features • Descendent relation statistics
Hypothesis h Text t rising choke cost is Newspapers on The of costs paper rising paper and revenues falling Verb Dependency Features • Combined verb descendent relation features • Worst verb descendent relation features
Hypothesis h Text t rising choke cost is Newspapers on The of costs paper rising paper and revenues falling SubjectDependencyFeatures • Combined and worst subject descendent relations • Combined and worst subject-to-verb paths
Other Dependency Features • Repeat these same features for: • Object • pcomp-n • Other descendent relations
Feature Analysis • All feature sets are contributing according to cross validation on the training set • Most significant feature set: • Unigram stem based word alignment • Most significant core repeated feature: • Average MLE
choke Newspapers on rising costs cost is rising paper and revenues The of falling paper Questions Dependency Path Based Entailment • Mixture of experts classifier using corpus co-occurrence statistics • Moving in the direction of DIRT • Domain of Interest: Student response analysis in intelligent tutoring systems Bag of Words Graph Matching Hypothesis h Text t
Why Entailment • Intelligent Tutoring Systems • Student Interaction Analysis • Are all aspects of the student’s answer entailed by the text and the gold standard answer • Are all aspects of the desired answer entailed by the student’s response
Word Alignment Features • Unigram word alignment
Word Alignment Features • Bigram word alignment • Example: • <t>Newspapers choke on rising paper costs and falling revenue.</t><h>The cost of paper is rising.</h> • MLE(cost, t) = ncost of, costs of /ncosts of = 6086/35800 = 0.17
Word Alignment Features • Average unigram and bigram • Stem-based tokens
Corpora • 7.4M articles/docs & 2.5B words, 347 words/doc • Gigaword (Graff, 2003) - • 5.7M articles, 2.1B words, 375 words/article • 77% of documents and 83% of indexed words • Reuters Corpus (Lewis et al., 2004) • 0.8M articles, 0.17B words, 213 words/article • TIPSTER • 0.9M articles, 0.26B words, 291 words/article