1 / 21

Experiments with the Negotiated Boolean Queries of the TREC 2007 Legal Discovery Track

Experiments with the Negotiated Boolean Queries of the TREC 2007 Legal Discovery Track. Stephen Tomlinson Open Text Corporation 2007 Nov 8. Overview. who won the boolean query “negotiations” ? can dropping the boolean operators improve on the boolean run’s Recall@B ?

janettea
Download Presentation

Experiments with the Negotiated Boolean Queries of the TREC 2007 Legal Discovery Track

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Experiments with the Negotiated Boolean Queries of theTREC 2007 Legal Discovery Track Stephen Tomlinson Open Text Corporation 2007 Nov 8

  2. Overview • who won the boolean query “negotiations” ? • can dropping the boolean operators improve on the boolean run’s Recall@B ? • did the boolean keywords (synonyms) improve on the natural language request text ? • can just relaxing the proximity constraints improve Recall@B ? • can blind feedback improve Recall@B ? • can a fusion of vector and boolean approaches improve Recall@B ?

  3. 3 Boolean Queries • Defendant • initial boolean query proposed by the defendant • Plaintiff • rejoinder boolean query from the plaintiff • Final • final negotiated boolean query

  4. Topic 74: “All scientific studies expressly referencing health effects tied to indoor air quality.” Defendant: "health effect!" w/10 "air quality" Plaintiff: (scien! OR stud! OR research) AND ("air quality" OR health) Final: (scien! OR stud! OR research) AND ("air quality" w/15 health)

  5. Topic 74 Boolean Results Defendant: "health effect!" w/10 "air quality" • 2691 matches, 82% precision, 3% recall Plaintiff: (scien! OR stud! OR research) AND ("air quality" OR health) • 858,700 matches, 64% precision@25000 (ranked), 25% recall@25000 (ranked) Final: (scien! OR stud! OR research) AND ("air quality" w/15 health) • 20,516 matches, 77% precision, 22% recall

  6. Topic 74: Missed Relevant Documents Final Boolean: (scien! OR stud! OR research) AND ("air quality" w/15 health) Passages in Missed Relevant Documents: • “… Lowrey A.H. (1980). Indoor air pollution …” • “assessment … entitled “Respiratory Health Effects of Passive Smoking …” • “study … funded by the Center for Indoor Air Research”

  7. Defendant vs. Final Boolean: Precision • Def. Boolean won 20 • Boolean won 22 • (1 tied) Mean in (-0.09, 0.15) Topic 63: 1.00 vs. 0.02 (sugar contract) Topic 69: 0.00 vs. 0.97 (indoor smoke ventilation)

  8. Defendant vs. Final Boolean: Recall • Def. Boolean won 0 • Boolean won 42 • (1 tied) Mean in (-0.27, -0.11) Topic 77: 0.00 vs. 0.00 (smoke NOT tobacco) Topic 52: 0.00 vs. 0.98 (boosting crop yields)

  9. Plaintiff vs. Final Boolean: Recall@25000 • Pl. Boolean won 35 • Boolean won 6 • (2 tied) Mean in (0.03, 0.19) Topic 59: 0.76 vs. 0.01 (limestone treatment) Topic 58: 0.24 vs. 0.94 (phosphates and health)

  10. Plaintiff vs. Final Boolean: Recall@B • Pl. Boolean won 15 • Boolean won 27 • (1 tied) Mean in (-0.09, 0.04) Topic 63: 0.73 vs. 0.27 (sugar contract) Topic 58: 0.18 vs. 0.94 (phosphates and health)

  11. Vector vs. Boolean (Example) Boolean: (scien! OR stud! OR research) AND ("air quality" w/15 health) Vector: scien! OR stud! OR research OR air OR quality OR health

  12. Relevance Ranking • term frequency dampening (BM25) • wildcard variants treated as same term • for boolean proximity constraints, only count term occurrences satisfying proximity • metadata + ocr included in document length • inverse document frequency (log) • based on most common variant for wildcards

  13. Vector vs. Boolean: Recall@B • Vector won 16 • Boolean won 26 • (1 tied) Mean in (-0.13, 0.02) Topic 63: 0.79 vs. 0.27 (sugar contract) Topic 58: 0.08 vs. 0.94 (phosphates and health)

  14. Topic 58: “… health problems caused by HPF …” Vector R@B=0.08, Boolean R@B=0.94 • (B=8183, estRel = 1151) Phosphat! w/75 (caus! OR relat! OR assoc! OR derive! OR correlat!) w/75 (health OR disorder! OR toxic! OR "chronic fatigue" OR dysfunction! OR irregular OR memor! OR immun! OR myopath! OR liver! OR kidney! OR heart! OR depress! OR loss OR lost) • vector matches often didn’t mention “Phospat!”

  15. Topic 72: “… chemical process(es) which result in onions … making persons cry” Vector R@B=0.03, Boolean R@B=0.78 • (B=119, estRel = 98) ((scien! OR research! OR chemical) w/25 onion!) AND (cries OR cry! OR tear!) • proximity clause found some long documents with just one reference to onions’ effects

  16. Topic 63: “… exclusivity clause in a sugar contract …” Vector R@B=0.79, Boolean R@B=0.27 • (B=294, estRel = 18) (Sugar w/20 (contract! OR agreement! OR deal!)) AND exclusiv! • boolean missed “U.S. sugar quota law”

  17. Request vs. Vector: R@25000 • Req. Vector won 21 • Vector won 22 • (0 tied) Mean in (0.00, 0.13) Topic 87: 1.00 vs. 0.13 (SEC reporting) Topic 84: 0.64 vs. 0.91 (1960s films)

  18. Impact of Doubling Proximity Distances: Recall@B • 2x-Prox Boolean won 14 • Boolean won 8 • (21 tied) Mean in (-0.03, 0.02) Topic 61: 0.49 vs. 0.44 (waste treatment) Topic 72: 0.39 vs. 0.78 (onions effect)

  19. Impact of Blind Feedback: Recall@B • Boolean+BF won 16 • Boolean won 21 • (6 tied) Mean in (-0.12, 0.03) Topic 90: 0.64 vs. 0.10 (sales in England) Topic 58: 0.01 vs. 0.94 (phosphates and health)

  20. Fusion of Boolean, Request and Vector: Recall@B • Fusion won 20 • Boolean won 20 • (3 tied) Mean in (-0.08, 0.03) Topic 65: 0.88 vs. 0.67 (candy packaging) Topic 58: 0.10 vs. 0.94 (phosphates and health)

  21. Conclusions • final negotiated boolean query often had substantially lower recall than the plaintiff boolean query • boolean operators (AND, proximity) often have value • blind feedback and fusion did not improve the boolean run’s Recall@B (on average)

More Related