350 likes | 362 Views
RTE @ Stanford. Rajat Raina, Aria Haghighi, Christopher Cox, Jenny Finkel, Jeff Michels, Kristina Toutanova, Bill MacCartney, Marie-Catherine de Marneffe, Christopher D. Manning and Andrew Y. Ng PASCAL Challenges Workshop April 12, 2005. Our approach.
E N D
RTE @ Stanford Rajat Raina, Aria Haghighi, Christopher Cox, Jenny Finkel, Jeff Michels, Kristina Toutanova, Bill MacCartney, Marie-Catherine de Marneffe, Christopher D. Manning and Andrew Y. Ng PASCAL Challenges Workshop April 12, 2005
Our approach • Represent using syntactic dependencies • But also use semantic annotations. • Try to handle language variability. • Perform semantic inference over this representation • Use linguistic knowledge sources. • Compute a “cost” for inferring hypothesis from text. Low cost Hypothesis is entailed.
Outline of this talk • Representation of sentences • Syntax: Parsing and post-processing • Adding annotations on representation (e.g., semantic roles) • Inference by graph matching • Inference by abductive theorem proving • A combined system • Results and error analysis
Sentence processing • Parse with a standard PCFG parser. [Klein & Manning, 2003] • Al Qaeda: [Aa]l[ -]Qa’?[ie]da • Train on some extra sentences from recent news. • Used a high-performing Named Entity Recognizer (next slide) • Force parse tree to be consistent with certain NE tags. Example: American Ministry of Foreign Affairs announced that Russia called the United States... (S (NP (NNP American_Ministry_of_Foreign_Affairs)) (VP (VBD announced) (…)))
Named Entity Recognizer • Trained a robust conditional random field model. [Finkel et al., 2003] • Interpretation of numeric quantity statements Example: T: Kessler's team conducted 60,643 face-to-face interviews with adults in 14 countries. H: Kessler's team interviewed more than 60,000 adults in 14 countries. TRUE Annotate numerical values implied by: • “6.2 bn”, “more than 60000”, “around 10”, … • MONEY/DATE named entities
Parse tree post-processing • Recognize collocations using WordNet Example: Shrek 2 rang up $92 million. (S (NP (NNP Shrek) (CD 2)) (VP (VBD rang_up) (NP (QP ($ $) (CD 92) (CD million)))) (. .)) MONEY, 9200000
Basic representations Parse tree Dependencies • Find syntactic dependencies • Transform parse tree representations into typed syntactic dependencies, including a certain amount of collapsing and normalization Example: Bill’s mother walked to the grocery store. subj(walked, mother) poss(mother, Bill) to(walked, store) nn(store, grocery) • Dependencies can also be written as a logical formula mother(A) Bill(B) poss(B, A) grocery(C) store(C) walked(E, A, C)
grocery store walked mother Bill subj to poss nn ARGM-LOC Representations Logical formula mother(A) Bill(B) poss(B, A) grocery(C) store(C) walked(E, A, C) Dependency graph VBD PERSON ARGM-LOC VBD PERSON • Can make representation richer • “walked” is a verb • “Bill” is a PERSON (named entity). • “store” is the location/destination of “walked”. • …
Annotations • Parts-of-speech, named entities • Already computed. • Semantic roles Example: T: C and D Technologies announced that it has closed the acquisition of Datel, Inc. H1: C and D Technologies acquired Datel Inc. TRUE H2: Datel acquired C and D Technologies. FALSE • Use a state-of-the-art semantic role classifier to label verb arguments. [Toutanova et al. 2005]
More annotations • Coreference Example: T: Since its formation in 1948, Israel … H: Israel was established in 1948. TRUE • Use a conditional random field model for coreference detection. Note: Appositive “references” were previously detected. T: Bush, the President of USA, went to Florida. H: Bush is the President of USA. TRUE • Other annotations • Word stems (very useful) • Word senses (no performance gain in our system)
Event nouns • Use a heuristic to find event nouns • Augment text representation using WordNet derivational links. Example: T: … witnessed the murder of police commander ... H: Police officer killed. TRUE Text logical formula: murder(M) police_commander(P) of(M, P) Augment: murder(E, M, P) NOUN: VERB:
Outline of this talk • Representation of sentences • Syntax: Parsing and post-processing • Adding annotations on representation (e.g., semantic roles) • Inference by graph matching • Inference by abductive theorem proving • A combined system • Results and error analysis
bought ARG0(Agent) ARG1(Theme) purchased John BMW ARG0(Agent) ARG1(Theme) PERSON John car PERSON Graph Matching Approach • Why Graph Matching?: • Dependency tree has natural graphical interpretation • Successful in other domains: e.g., Lossy image matching • Input: Hypothesis (H) and Text (T) Graphs Toy example: • Vertices are words and phrases • Edges are labeled dependencies • Output: Cost of matching H to T (next slide)
bought ARG0(Agent) ARG1(Theme) T purchased John BMW ARG0(Agent) ARG1(Theme) matching PERSON John car PERSON H Graph Matching: Idea • Idea: Align H to T so that vertices are similar and preserve relations (as in machine translation) • A matching M is a mapping from vertices of H to vertices of T Thus, for each vertex v in H, M(v) is a vertex in T
Graph Matching: Costs • The cost of a matching MatchCost(M) measures the “quality” of a matching M • VertexCost(M) – Compare vertices in H with matched vertices in T • RelationCost(M) – Compare edges (relations) in H with corresponding edges (relations) in T • MatchCost(M) = (1 - ß) VertexCost(M) + ß RelationCost(M)
Graph Matching: Costs • VertexCost(M) For each vertex v in H, and vertex M(v) in T: • Do vertex heads share same stem and/or POS ? • Is T vertex head a hypernym of H vertex head? • Are vertex heads “similar” phrases? (next slide) • RelationCost(M) For each edge (v,v’) in H, and edge (M(v),M(v’)) in T • Are parent/child pairs in H parent/child in T ? • Are parent/child pairs in H ancestor/descendant in T ? • Do parent/child pairs in H share a common ancestor in T?
Digression: Phrase similarity • Measures based on WordNet (Resnik/Lesk). • Distributional similarity • Example: “run” and “marathon” are related. • Latent Semantic Analysis to discover words that are distributionally similar (i.e., have common neighbors). • Used a web-search based measure • Query google.com for all pages with: • “run” • “marathon” • Both “run” and “marathon” • Learning paraphrases. [Similar to DIRT: Lin and Pantel, 2001] • “World knowledge” (labor intensive) • CEO = Chief_Executive_Officer • Philippines Filipino • [Can add common facts: “Paris is the capital of France”, …]
Graph Matching: Costs • VertexCost(M) For each vertex v in H, and vertex M(v) in T: • Do vertex heads share same stem and or POS ? • Is T vertex head a hypernym of H vertex head? • Are vertex heads “similar” phrases? (next slide) • RelationCost(M) For each edge (v,v’) in H, and edge (M(v),M(v’)) in T • Are parent/child pairs in H parent/child in T ? • Are parent/child pairs in H ancestor/descendant in T ? • Do parent/child pairs in H share a common ancestor in T?
bought ARG0(Agent) ARG1(Theme) John BMW PERSON Synonym Match Cost: 0.2 Hypernym Match Cost: 0.4 Exact Match Cost: 0.0 purchased ARG0(Agent) ARG1(Theme) John car PERSON Graph Matching: Example VertexCost: (0.0 + 0.2 + 0.4)/3 = 0.2 RelationCost: 0 (Graphs Isomorphic) ß = 0.45 (say) MatchCost: 0.55 * (0.2) + 0.45 * (0.0) = 0.11
Outline of this talk • Representation of sentences • Syntax: Parsing and post-processing • Adding annotations on representation (e.g., semantic roles) • Inference by graph matching • Inference by abductive theorem proving • A combined system • Results and error analysis
Abductive inference • Idea: • Represent text and hypothesis as logical formulae. • A hypothesis can be inferred from the text if and only if the hypothesis logical formula can be proved from the text logical formula. • Toy example: Prove? Allow assumptions at various “costs” BMW(t) + $2 => car(t) bought(p, q, r) + $1 => purchased(p, q, r)
P(p1, p2, …, pm) Q(q1, q2, …, qn) Abductive assumptions • Assign costs to all assumptions of the form: • Build an assumption cost model
Abductive theorem proving • Each assumption provides a potential proof step. • Find the proof with the minimum total cost • Uniform cost search • If there is a low-cost proof, the hypothesis is entailed. • Example: T: John(A) BMW(B) bought(E, A, B) H: John(x) car(y) purchased(z, x, y) Here is a possible proof by resolution refutation (for the earlier costs): $0 -John(x) -car(y) -purchased(z, x, y) [Given: negation of hypothesis] $0 -car(y) -purchased(z, A, y) [Unify with John(A)] $2 -purchased(z, A, B) [Unify with BMW(B)] $3 NULL [Unify with purchased(E, A, B)] Proof cost = 3
Abductive theorem proving • Can automatically learn good assumption costs • Start from a labeled dataset (e.g.: the PASCAL development set) • Intuition: Find assumptions that are used in the proofs for TRUE examples, and lower their costs (by framing a log-linear model). Iterate. [Details: Raina et al., in submission]
Some interesting features • Examples of handling “complex” constructions in graph matching/abductive inference. • Antonyms/Negation: High cost for matching verbs, if they are antonyms or one is negated and the other not. • T: Stocks fell. H: Stocks rose. FALSE • T: Clinton’s book was not a hit H: Clinton’s book was a hit. FALSE • Non-factive verbs: • T: John was charged for doing X. H: John did X. FALSE Can detect because “doing” in text has non-factive “charged” as a parent but “did” does not have such a parent.
Some interesting features • “Superlative check” • T: This is the tallest tower in western Japan. H: This is the tallest tower in Japan. FALSE
Outline of this talk • Representation of sentences • Syntax: Parsing and post-processing • Adding annotations on representation (e.g., semantic roles) • Inference by graph matching • Inference by abductive theorem proving • A combined system • Results and error analysis
Results • Combine inference methods • Each system produces a score. Separately normalize each system’s score variance. Suppose normalized scores are s1 and s2. • Final score S = w1s1 + w2s2 • Learn classifier weights w1 and w2 on the development set using logistic regression. Two submissions: • Train one classifier weight for all RTE tasks. (General) • Train different classifier weights for each RTE task. (ByTask)
Results • Best other results: Accuracy=58.6%, CWS=0.617 • Balanced predictions. [55.4%, 51.2% predicted TRUE on test set.]
Partial coverage results Task-specific optimization seems better! ByTask ByTask General • Can also draw coverage-CWS curves. For example: • at 50% coverage, CWS = 0.781 • at 25% coverage, CWS = 0.873
Some interesting issues • Phrase similarity • away from the coast farther inland • won victory in presidential election became President • stocks get a lift stocks rise • life threatening fatal • Dictionary definitions • believe there is only one God are monotheistic • “World knowledge” • K Club, venue of the Ryder Cup, … K Club will host the Ryder Cup
Future directions • Need more NLP components in there: • Better treatment of frequent nominalizations, parenthesized material, etc. • Need much more ability to do inference • Fine distinctions between meanings, and fine similarities. e.g., “reach a higher level” and “rise” We need a high-recall, reasonable precision similarity measure! • Other resources (e.g., antonyms) are also very sparse. • More task-specific optimization.