1 / 51

Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan

Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan. Outline. Allowing us to. Empirically Compare Different Resources. 1. We address. Inference-Rule Evaluation. 2. By. Crowdsourcing Rule Applications Annotation. 3. Bar Ilan University @ ACL 2012. 2.

jolene
Download Presentation

Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Crowdsourcing Inference-Rule Evaluation Naomi Zeichner, Jonathan Berant, Ido Dagan

  2. Outline Allowing us to Empirically Compare Different Resources 1 We address Inference-Rule Evaluation 2 By Crowdsourcing Rule Applications Annotation 3 Bar Ilan University @ ACL 2012 2

  3. 2 By Crowdsourcing Rule Applications Annotation 3 Allowing us to Empirically Compare Different Resources 1 We address Inference-Rule Evaluation Bar Ilan University @ ACL 2012 2

  4. Q Where was Reagan raised? X brought up in Y  X raised in Y Empirically Compare Different Resources Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Inference Rules – important component in semantic applications AReagan was brought up in Dixon. Bar Ilan University @ ACL 2012

  5. Inference Rules – important component in semantic applications IE Hiring Event Q Where was Reagan raised? Bob worked as an analyst for Dell X brought up in Y  X raised in Y PERSON ROLE Empirically Compare Different Resources Inference-Rule Evaluation Crowdsourcing Rule Application Annotations AReagan was brought up in Dixon. Bar Ilan University @ ACL 2012 3

  6. Inference Rules – important component in semantic applications IE Hiring Event Q Where was Reagan raised? Bob worked as an analyst for Dell X brought up in Y  X raised in Y PERSON ROLE Empirically Compare Different Resources Inference-Rule Evaluation Crowdsourcing Rule Application Annotations AReagan was brought up in Dixon. X work as Y  X hired as Y Bar Ilan University @ ACL 2012 3

  7. Inference Rules – important component in semantic applications IE Hiring Event Q Where was Reagan raised? Bob worked as an analyst for Dell X brought up in Y  X raised in Y PERSON ROLE analyst Bob Empirically Compare Different Resources Inference-Rule Evaluation Crowdsourcing Rule Application Annotations AReagan was brought up in Dixon. X work as Y  X hired as Y Bar Ilan University @ ACL 2012 3

  8. Empirically Compare Different Resources Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Evaluation - What are the options? Bar Ilan University @ ACL 2012

  9. Evaluation - What are the options? 1 • Impact on end task • QA, IE, RTE • Pro: What interests an inference system developer • Con: Many components, address multiple phenomena • Hard to asses the effect of a single resource. Empirically Compare Different Resources Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Bar Ilan University @ ACL 2012 4

  10. Evaluation - What are the options? 1 • Impact on end task • QA, IE, RTE • Pro: What interests an inference system developer • Con: Many components, address multiple phenomena • Hard to asses the effect of a single resource. • Judge rule correctness directly • Pro: Theoretically most intuitive • Con: In fact hard to do • Often results in low inter-annotator agreement. 2 Empirically Compare Different Resources Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Bar Ilan University @ ACL 2012 4

  11. Evaluation - What are the options? 1 • Impact on end task • QA, IE, RTE • Pro: What interests an inference system developer • Con: Many components, address multiple phenomena • Hard to asses the effect of a single resource. • Judge rule correctness directly • Pro: Theoretically most intuitive • Con: In fact hard to do • Often results in low inter-annotator agreement. 2 X reside in Y  X born in Y Empirically Compare Different Resources Inference-Rule Evaluation Crowdsourcing Rule Application Annotations X reside in Y  X live in Y Bar Ilan University @ ACL 2012 4

  12. Evaluation - What are the options? 1 • Impact on end task • QA, IE, RTE • Pro: What interests an inference system developer • Con: Many components, address multiple phenomena • Hard to asses the effect of a single resource. • Judge rule correctness directly • Pro: Theoretically most intuitive • Con: In fact hard to do • Often results in low inter-annotator agreement. 2 X reside in Y  X born in Y X criticize Y  X attack Y Empirically Compare Different Resources Inference-Rule Evaluation Crowdsourcing Rule Application Annotations X reside in Y  X live in Y Bar Ilan University @ ACL 2012 4

  13. Evaluation - What are the options? 1 • Impact on end task • QA, IE, RTE • Pro: What interests an inference system developer • Con: Many components, address multiple phenomena • Hard to asses the effect of a single resource. Instance-based evaluation (Szpektor et al 2007., Bhagat et al. 2007) Pro: Simulates utility of rules in an application Yields high inter-annotator agreement. • Judge rule correctness directly • Pro: Theoretically most intuitive • Con: In fact hard to do • Often results in low inter-annotator agreement. 2 3 X reside in Y  X born in Y X criticize Y  X attack Y Empirically Compare Different Resources Inference-Rule Evaluation Crowdsourcing Rule Application Annotations X reside in Y  X live in Y Bar Ilan University @ ACL 2012 4

  14. Evaluation - What are the options? 1 • Impact on end task • QA, IE, RTE • Pro: What interests an inference system developer • Con: Many components, address multiple phenomena • Hard to asses the effect of a single resource. Instance-based evaluation (Szpektor et al 2007., Bhagat et al. 2007) Pro: Simulates utility of rules in an application Yields high inter-annotator agreement. • Judge rule correctness directly • Pro: Theoretically most intuitive • Con: In fact hard to do • Often results in low inter-annotator agreement. 2 3 X reside in Y  X born in Y X criticize Y  X attack Y Empirically Compare Different Resources Inference-Rule Evaluation Crowdsourcing Rule Application Annotations X reside in Y  X live in Y Bar Ilan University @ ACL 2012 4

  15. Empirically Compare Different Resources Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Instance Based Evaluation – Decisions Target: Judge if a rule application is valid or not Bar Ilan University @ ACL 2012 5

  16. Rule:X teach Y X explain to Y Empirically Compare Different Resources Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Instance Based Evaluation – Decisions Target: Judge if a rule application is valid or not Bar Ilan University @ ACL 2012 5

  17. Rule:X teach Y X explain to Y Empirically Compare Different Resources Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Instance Based Evaluation – Decisions Target: Judge if a rule application is valid or not LHS: Steve teaches kids Bar Ilan University @ ACL 2012 5

  18. Rule:X teach Y X explain to Y LHS: Steve teaches kids RHS: Steve explains to kids Empirically Compare Different Resources Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Instance Based Evaluation – Decisions Target: Judge if a rule application is valid or not Bar Ilan University @ ACL 2012 5

  19. Rule:X teach Y X explain to Y LHS: Steve teaches kids RHS: Steve explains to kids Empirically Compare Different Resources Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Instance Based Evaluation – Decisions Target: Judge if a rule application is valid or not Bar Ilan University @ ACL 2012 5

  20. Rule:X teach Y X explain to Y LHS: Steve teaches kids RHS: Steve explains to kids Rule: X resides in Y  X born in Y LHS: He resides in Paris RHS: He born in Paris Empirically Compare Different Resources Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Instance Based Evaluation – Decisions Target: Judge if a rule application is valid or not Bar Ilan University @ ACL 2012 5

  21. Rule:X teach Y X explain to Y LHS: Steve teaches kids RHS: Steve explains to kids Rule: X resides in Y  X born in Y LHS: He resides in Paris RHS: He born in Paris Empirically Compare Different Resources Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Instance Based Evaluation – Decisions Target: Judge if a rule application is valid or not Bar Ilan University @ ACL 2012 5

  22. Rule:X teach Y X explain to Y LHS: Steve teaches kids RHS: Steve explains to kids Rule: X resides in Y  X born in Y LHS: He resides in Paris RHS: He born in Paris Rule:X turn in Y X bring in Y LHS: humans turn in bed RHS: humans bring in bed Empirically Compare Different Resources Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Instance Based Evaluation – Decisions Target: Judge if a rule application is valid or not Bar Ilan University @ ACL 2012 5

  23. Rule:X teach Y X explain to Y LHS: Steve teaches kids RHS: Steve explains to kids Rule: X resides in Y  X born in Y LHS: He resides in Paris RHS: He born in Paris Rule:X turn in Y X bring in Y LHS: humans turn in bed RHS: humans bring in bed Empirically Compare Different Resources Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Instance Based Evaluation – Decisions Target: Judge if a rule application is valid or not Bar Ilan University @ ACL 2012 5

  24. Rule:X teach Y X explain to Y LHS: Steve teaches kids RHS: Steve explains to kids • Our Goal: • Robust • Replicable Rule: X resides in Y  X born in Y LHS: He resides in Paris RHS: He born in Paris Rule:X turn in Y X bring in Y LHS: humans turn in bed RHS: humans bring in bed Empirically Compare Different Resources Inference-Rule Evaluation Crowdsourcing Rule Application Annotations Instance Based Evaluation – Decisions Target: Judge if a rule application is valid or not Bar Ilan University @ ACL 2012 5

  25. Crowdsourcing Empirically Compare Different Resources Inference-Rule Evaluation Crowdsourcing Rule Application Annotations • Recent trend of using crowdsourcing for annotation tasks • Previous Works (Snow et al., 2008; Wang and Callison-Burch, 2010; Mehdad et al., 2010; Negri et al., 2011) • Focused on RTE text-hypothesis pairs • Didn’t address annotation and evaluation of rules Bar Ilan University @ ACL 2012 6

  26. Crowdsourcing Empirically Compare Different Resources Inference-Rule Evaluation Crowdsourcing Rule Application Annotations • Recent trend of using crowdsourcing for annotation tasks • Previous Works (Snow et al., 2008; Wang and Callison-Burch, 2010; Mehdad et al., 2010; Negri et al., 2011) • Focused on RTE text-hypothesis pairs • Didn’t address annotation and evaluation of rules Challenges Bar Ilan University @ ACL 2012 6

  27. Crowdsourcing Empirically Compare Different Resources Inference-Rule Evaluation Crowdsourcing Rule Application Annotations • Recent trend of using crowdsourcing for annotation tasks • Previous Works (Snow et al., 2008; Wang and Callison-Burch, 2010; Mehdad et al., 2010; Negri et al., 2011) • Focused on RTE text-hypothesis pairs • Didn’t address annotation and evaluation of rules Challenges • Simplify Bar Ilan University @ ACL 2012 6

  28. Crowdsourcing Empirically Compare Different Resources Inference-Rule Evaluation Crowdsourcing Rule Application Annotations • Recent trend of using crowdsourcing for annotation tasks • Previous Works (Snow et al., 2008; Wang and Callison-Burch, 2010; Mehdad et al., 2010; Negri et al., 2011) • Focused on RTE text-hypothesis pairs • Didn’t address annotation and evaluation of rules Challenges • Simplify • Communicate Bar Ilan University @ ACL 2012 6

  29. 1 By Crowdsourcing Rule Applications Annotation 3 Allowing us to Empirically Compare Different Resources Weaddress Inference-Rule Evaluation 2 Bar Ilan University @ ACL 2012 7

  30. Empirically Compare Different Resources Simplify Process Inference-RuleEvaluation CrowdsourcingRuleApplicationAnnotations Bar Ilan University @ ACL 2012 8

  31. Simple Tasks Empirically Compare Different Resources Simplify Process Inference-RuleEvaluation CrowdsourcingRuleApplicationAnnotations Bar Ilan University @ ACL 2012 8

  32. 1 Is a phrase meaningful? Simple Tasks Empirically Compare Different Resources Simplify Process Inference-RuleEvaluation CrowdsourcingRuleApplicationAnnotations Bar Ilan University @ ACL 2012 8

  33. 1 Is a phrase meaningful? Simple Tasks Rule:X teach Y X explain to Y LHS: Steve teaches kids RHS: Steve explains to kids Rule: X resides in Y  X born in Y LHS: He resides in Paris RHS: He born in Paris Rule:X turn in Y X bring in Y LHS: humans turn in bed RHS: humans bring in bed Empirically Compare Different Resources Simplify Process Inference-RuleEvaluation CrowdsourcingRuleApplicationAnnotations Bar Ilan University @ ACL 2012 8

  34. 1 Is a phrase meaningful? Simple Tasks Rule:X teach Y X explain to Y LHS: Steve teaches kids RHS: Steve explains to kids Rule: X resides in Y  X born in Y LHS: He resides in Paris RHS: He born in Paris Rule:X turn in Y X bring in Y LHS: humans turn in bed RHS: humans bring in bed Empirically Compare Different Resources Simplify Process Steve teaches kids Steve explains to kids He resides in Paris He born in Paris humans turn in bed humans bring in bed Inference-RuleEvaluation CrowdsourcingRuleApplicationAnnotations Bar Ilan University @ ACL 2012 8

  35. 1 Is a phrase meaningful? Simple Tasks Rule:X teach Y X explain to Y LHS: Steve teaches kids RHS: Steve explains to kids Rule: X resides in Y  X born in Y LHS: He resides in Paris RHS: He born in Paris Rule:X turn in Y X bring in Y LHS: humans turn in bed RHS: humans bring in bed Empirically Compare Different Resources Simplify Process humans turn in bed Steve teaches kids He born in Paris He resides in Paris humans bring in bed Steve explains to kids Inference-RuleEvaluation CrowdsourcingRuleApplicationAnnotations Bar Ilan University @ ACL 2012 8

  36. 1 Is a phrase meaningful? Simple Tasks Rule:X teach Y X explain to Y LHS: Steve teaches kids RHS: Steve explains to kids 2 Judge if one phrase is true given another. Rule: X resides in Y  X born in Y LHS: He resides in Paris RHS: He born in Paris Rule:X turn in Y X bring in Y LHS: humans turn in bed RHS: humans bring in bed Empirically Compare Different Resources Simplify Process humans turn in bed Steve teaches kids He born in Paris He resides in Paris humans bring in bed Steve explains to kids Inference-RuleEvaluation CrowdsourcingRuleApplicationAnnotations Bar Ilan University @ ACL 2012 8

  37. 1 Is a phrase meaningful? Simple Tasks Rule:X teach Y X explain to Y LHS: Steve teaches kids RHS: Steve explains to kids 2 Judge if one phrase is true given another. Rule: X resides in Y  X born in Y LHS: He resides in Paris RHS: He born in Paris Rule:X turn in Y X bring in Y LHS: humans turn in bed RHS: humans bring in bed Empirically Compare Different Resources Simplify Process humans turn in bed Steve teaches kids Steve teaches kids He born in Paris He born in Paris He resides in Paris He resides in Paris humans bring in bed Steve explains to kids Steve explains to kids Steve explains to kids Inference-RuleEvaluation CrowdsourcingRuleApplicationAnnotations Bar Ilan University @ ACL 2012 8

  38. 1 Is a phrase meaningful? Simple Tasks Rule:X teach Y X explain to Y LHS: Steve teaches kids RHS: Steve explains to kids 2 Judge if one phrase is true given another. Rule: X resides in Y  X born in Y LHS: He resides in Paris RHS: He born in Paris Rule:X turn in Y X bring in Y LHS: humans turn in bed RHS: humans bring in bed Empirically Compare Different Resources Simplify Process humans turn in bed they observe holidays He born in Paris He resides in Paris humans bring in bed they celebrate holidays Steve teaches kids Steve explains to kids He resides in Paris He born in Paris Inference-RuleEvaluation CrowdsourcingRuleApplicationAnnotations Bar Ilan University @ ACL 2012 8

  39. Empirically Compare Different Resources Communicate Entailment Inference-RuleEvaluation CrowdsourcingRuleApplicationAnnotations Bar Ilan University @ ACL 2012 9

  40. Gold Standard Empirically Compare Different Resources Communicate Entailment Inference-RuleEvaluation CrowdsourcingRuleApplicationAnnotations Bar Ilan University @ ACL 2012 9

  41. 1 • Educating • “Confusing” examples used as gold with feedback if Turkers get them wrong Gold Standard Empirically Compare Different Resources Communicate Entailment Inference-RuleEvaluation CrowdsourcingRuleApplicationAnnotations Bar Ilan University @ ACL 2012 9

  42. 1 • Educating • “Confusing” examples used as gold with feedback if Turkers get them wrong 2 • Enforcing • Unanimousexamples used as gold to estimate Turker reliability Gold Standard Empirically Compare Different Resources Communicate Entailment Inference-RuleEvaluation CrowdsourcingRuleApplicationAnnotations Bar Ilan University @ ACL 2012 9

  43. Empirically Compare Different Resources Communicate - Effect of Communication Inference-RuleEvaluation CrowdsourcingRuleApplicationAnnotations Bar Ilan University @ ACL 2012 10

  44. Empirically Compare Different Resources Communicate - Effect of Communication Inference-RuleEvaluation CrowdsourcingRuleApplicationAnnotations Bar Ilan University @ ACL 2012 10

  45. Empirically Compare Different Resources Communicate - Effect of Communication 63% of annotations judged unanimously between annotators and with our annotation Inference-RuleEvaluation CrowdsourcingRuleApplicationAnnotations Bar Ilan University @ ACL 2012 10

  46. 2 By Crowdsourcing Rule Applications Annotation Allowingusto Empirically Compare Different Resources 1 Weaddress Inference-Rule Evaluation 3 Bar Ilan University @ ACL 2012 11

  47. Case Study – Data Set • Executed four entailment rule learning methods on a set of 1B extractions extracted by ReVerb (Fader et al. 2011) • Applied rules on randomly sampled extractions to get 20,000 rule applications • Annotated each rule application using our framework improves sate-of-the-art Empirically Compare Different Resources Inference-RuleEvaluation CrowdsourcingRuleApplicationAnnotations Bar Ilan University @ ACL 2012

  48. Case Study – Algorithm Comparison Empirically Compare Different Resources Inference-RuleEvaluation CrowdsourcingRuleApplicationAnnotations Bar Ilan University @ ACL 2012 13

  49. Case Study – Output • Task 1 • 1,012 meaningful LHS; meaningless RHS • 8,264 both sides were judged meaningful • Task 2 • 2,447 positive entailment • 3,108 negative entailment • Overall • 6,567 rule applications • Annotated for $1000 • About a week • non-entailment • passed to Task 2 Empirically Compare Different Resources Inference-RuleEvaluation CrowdsourcingRuleApplicationAnnotations Bar Ilan University @ ACL 2012

  50. Empirically Compare Different Resources Summary A framework for crowdsourcing inference rule evaluation • Simplifies instance-based evaluation • Communicates entailment decision across to Turkers • Proposed framework can be beneficial for • resource developers • inference system developers Crowdsourcing forms and annotated extractions can be found at: BIU NLP downloads: http://www.cs.biu.ac.il/~nlp/downloads Inference-RuleEvaluation CrowdsourcingRuleApplicationAnnotations Bar Ilan University @ ACL 2012

More Related