210 likes | 219 Views
Experiments with the Negotiated Boolean Queries of the TREC 2007 Legal Discovery Track. Stephen Tomlinson Open Text Corporation 2007 Nov 8. Overview. who won the boolean query “negotiations” ? can dropping the boolean operators improve on the boolean run’s Recall@B ?
E N D
Experiments with the Negotiated Boolean Queries of theTREC 2007 Legal Discovery Track Stephen Tomlinson Open Text Corporation 2007 Nov 8
Overview • who won the boolean query “negotiations” ? • can dropping the boolean operators improve on the boolean run’s Recall@B ? • did the boolean keywords (synonyms) improve on the natural language request text ? • can just relaxing the proximity constraints improve Recall@B ? • can blind feedback improve Recall@B ? • can a fusion of vector and boolean approaches improve Recall@B ?
3 Boolean Queries • Defendant • initial boolean query proposed by the defendant • Plaintiff • rejoinder boolean query from the plaintiff • Final • final negotiated boolean query
Topic 74: “All scientific studies expressly referencing health effects tied to indoor air quality.” Defendant: "health effect!" w/10 "air quality" Plaintiff: (scien! OR stud! OR research) AND ("air quality" OR health) Final: (scien! OR stud! OR research) AND ("air quality" w/15 health)
Topic 74 Boolean Results Defendant: "health effect!" w/10 "air quality" • 2691 matches, 82% precision, 3% recall Plaintiff: (scien! OR stud! OR research) AND ("air quality" OR health) • 858,700 matches, 64% precision@25000 (ranked), 25% recall@25000 (ranked) Final: (scien! OR stud! OR research) AND ("air quality" w/15 health) • 20,516 matches, 77% precision, 22% recall
Topic 74: Missed Relevant Documents Final Boolean: (scien! OR stud! OR research) AND ("air quality" w/15 health) Passages in Missed Relevant Documents: • “… Lowrey A.H. (1980). Indoor air pollution …” • “assessment … entitled “Respiratory Health Effects of Passive Smoking …” • “study … funded by the Center for Indoor Air Research”
Defendant vs. Final Boolean: Precision • Def. Boolean won 20 • Boolean won 22 • (1 tied) Mean in (-0.09, 0.15) Topic 63: 1.00 vs. 0.02 (sugar contract) Topic 69: 0.00 vs. 0.97 (indoor smoke ventilation)
Defendant vs. Final Boolean: Recall • Def. Boolean won 0 • Boolean won 42 • (1 tied) Mean in (-0.27, -0.11) Topic 77: 0.00 vs. 0.00 (smoke NOT tobacco) Topic 52: 0.00 vs. 0.98 (boosting crop yields)
Plaintiff vs. Final Boolean: Recall@25000 • Pl. Boolean won 35 • Boolean won 6 • (2 tied) Mean in (0.03, 0.19) Topic 59: 0.76 vs. 0.01 (limestone treatment) Topic 58: 0.24 vs. 0.94 (phosphates and health)
Plaintiff vs. Final Boolean: Recall@B • Pl. Boolean won 15 • Boolean won 27 • (1 tied) Mean in (-0.09, 0.04) Topic 63: 0.73 vs. 0.27 (sugar contract) Topic 58: 0.18 vs. 0.94 (phosphates and health)
Vector vs. Boolean (Example) Boolean: (scien! OR stud! OR research) AND ("air quality" w/15 health) Vector: scien! OR stud! OR research OR air OR quality OR health
Relevance Ranking • term frequency dampening (BM25) • wildcard variants treated as same term • for boolean proximity constraints, only count term occurrences satisfying proximity • metadata + ocr included in document length • inverse document frequency (log) • based on most common variant for wildcards
Vector vs. Boolean: Recall@B • Vector won 16 • Boolean won 26 • (1 tied) Mean in (-0.13, 0.02) Topic 63: 0.79 vs. 0.27 (sugar contract) Topic 58: 0.08 vs. 0.94 (phosphates and health)
Topic 58: “… health problems caused by HPF …” Vector R@B=0.08, Boolean R@B=0.94 • (B=8183, estRel = 1151) Phosphat! w/75 (caus! OR relat! OR assoc! OR derive! OR correlat!) w/75 (health OR disorder! OR toxic! OR "chronic fatigue" OR dysfunction! OR irregular OR memor! OR immun! OR myopath! OR liver! OR kidney! OR heart! OR depress! OR loss OR lost) • vector matches often didn’t mention “Phospat!”
Topic 72: “… chemical process(es) which result in onions … making persons cry” Vector R@B=0.03, Boolean R@B=0.78 • (B=119, estRel = 98) ((scien! OR research! OR chemical) w/25 onion!) AND (cries OR cry! OR tear!) • proximity clause found some long documents with just one reference to onions’ effects
Topic 63: “… exclusivity clause in a sugar contract …” Vector R@B=0.79, Boolean R@B=0.27 • (B=294, estRel = 18) (Sugar w/20 (contract! OR agreement! OR deal!)) AND exclusiv! • boolean missed “U.S. sugar quota law”
Request vs. Vector: R@25000 • Req. Vector won 21 • Vector won 22 • (0 tied) Mean in (0.00, 0.13) Topic 87: 1.00 vs. 0.13 (SEC reporting) Topic 84: 0.64 vs. 0.91 (1960s films)
Impact of Doubling Proximity Distances: Recall@B • 2x-Prox Boolean won 14 • Boolean won 8 • (21 tied) Mean in (-0.03, 0.02) Topic 61: 0.49 vs. 0.44 (waste treatment) Topic 72: 0.39 vs. 0.78 (onions effect)
Impact of Blind Feedback: Recall@B • Boolean+BF won 16 • Boolean won 21 • (6 tied) Mean in (-0.12, 0.03) Topic 90: 0.64 vs. 0.10 (sales in England) Topic 58: 0.01 vs. 0.94 (phosphates and health)
Fusion of Boolean, Request and Vector: Recall@B • Fusion won 20 • Boolean won 20 • (3 tied) Mean in (-0.08, 0.03) Topic 65: 0.88 vs. 0.67 (candy packaging) Topic 58: 0.10 vs. 0.94 (phosphates and health)
Conclusions • final negotiated boolean query often had substantially lower recall than the plaintiff boolean query • boolean operators (AND, proximity) often have value • blind feedback and fusion did not improve the boolean run’s Recall@B (on average)