170 likes | 286 Views
Planning for the TREC 2008 Legal Track. Douglas Oard Stephen Tomlinson Jason Baron. Agenda. Track goals Deciding on a document collection “Beating Boolean” Handling nasty OCR Making the best use of the metadata Ad hoc task design Interactive task design Relevance feedback task design
E N D
Planning for the TREC 2008Legal Track Douglas Oard Stephen Tomlinson Jason Baron
Agenda • Track goals • Deciding on a document collection • “Beating Boolean” • Handling nasty OCR • Making the best use of the metadata • Ad hoc task design • Interactive task design • Relevance feedback task design • Other issues
Track Goals • Develop a reusable test collection • Documents, topics, evaluation measures • Foster formation of a research community • Establish baseline results
Choosing a Collection • FERC Enron (w/attachments, full headers) • Somewhat larger than CMU • Email is the real killer app for E-discovery • IIT CDIP version 1.0 (same as 2006/07) • We have 83 topics. Do we need more? • State Department Cables • Task model would be FOIA, not E-Discovery
TREC Topic Number: 1 • Title: Marketers or Traders of Electricity on the Financial Market • Description: Identify Enron employees who bought and sold electricity on California’s financial (long-term sales) energy market, solely for the purpose of re-buying/re-selling this energy later for a profit. • Narrative: A relevant document must at a minimum identify the name and email address of the marketer, as well as the Enron subsidiary to which he/she belonged. The marketer’s phone number would be helpful as well, to help analysis of the corresponding Enron voice dataset. • Hint: Enron Power Marketing, Inc. (EPMI), Enron Energy Services, Inc. and Enron Energy Marketing Corporation all appear to have conducted long-term marketing services for Enron. This observation is based on the fact that Enron submitted information for all three of these subsidiaries in its reply to FERC’s data request 2 (DR2). (DR2 asked Enron to submit information about its short-term and long-term sales. Enron replied with data from these three subsidiaries.) (38, pp. 1-2, plus personal analysis.) It would be good, however, to know for sure which entities or persons did marketing at Enron. • Query Possibilities: • • (marketer or marketers or “Enron Power Marketing” or EPMI or “Enron Energy Services” or “Enron Energy Marketing Corporation”) • • (marketer or marketers or “Enron Power Marketing” or EPMI or “Enron Energy Services” or “Enron Energy Marketing Corporation”) and (MW or KW or watt* or MwH or KwH) • o This is to target electricity sales rather than natural gas sales. All the subsequent electricity queries can be similarly modified. • • (marketer or marketers or EPMI) and (short or long) • o As in have a long or short position in sales/purchases. • • (marketer or marketers or EPMI) and (NYMEX or CBOT or “Mid-Columbia” or COB or “California-Oregon Border” or “Four Corners” or “Palo Verde” or EOL) • o The electricity futures hubs were Mid-Columbia, COB, Four Corners, and Palo Verde, as best the author can tell. (85) NYMEX and CBOT ran these. (89; 15, p. 78) • o EOL was the forward market trading place. (36, p. 3)
82,084 addr-name 3,151 addr-nickname 19,708 addr-addr Identity Modeling in Enron m scott susan m scott m..scott@enron.com susan scott suebob sue sscott susan susan scott sscott5@enron.com again sscott5 susan ciao susan m scott friday com members scott susan scott.susan@enron.com 66,715 models susan m scott susan scott
Enron Identity Test Collections Test Collections Enron-all Enron-subset Sager Shapiro
Example Document Scanned OCR Metadata Philip Moxx's. U.S.A. x.dr~am~c. cvrrespoaa.aa Benffrts Departmext Rieh>pwna, Yfe&ia Ta: Dishlbutfon Data aday 90,1997. From: Lisa Fislla Sabj.csr CIGNA WeWedng Newsbttsr -Yntsre StratsU During our last CIGNA Aatfoa Plan meadng, tlu iasuo of wLetSae to i0op per'Irw+ng artieles aod discontinue mndia6 CIGNA Well-Being aawslener to om employees was a msiter of disanision . I Imvm done somme reaearc>>, and wanted to pruedt you with my Sadings and pcdiminary recwmmeadatioa for PM's atratezy Ieprding l4aas aewelattee* . I believe .vayone'a input is valusble, and would epproolate hoarlng fmaa aaeh of you on whetlne you concur with my reeommendatioa … Title:CIGNA WELL-BEING NEWSLETTER - FUTURE STRATEGY Organization Authors:PMUSA, PHILIP MORRIS USA Person Authors:HALLE, L Document Date:19970530 Document Type:MEMO, MEMORANDUM Bates Number:2078039376/9377 Page Count:2 Collection:Philip Morris
State Department Cables 791,857 records – 550,983 of which are full text
Handling Nasty OCR • Index pruning • Error estimation • Character n-grams • Duplicate detection • Expansion using a cleaner collection
How to “Beat Boolean” • Work from reference Boolean? • Swap out low-ranked-in for high-ranked-out • Relax Boolean somehow? • Cover density, proximity perturbation, …
Using Metadata • Title (term match) • Author (social network • Bates number (sequence)
Ad Hoc Task Design • Evaluation measures • R@B?, P@R?, Index size? • Error bars / Statistical significance testing • Limits on post-hoc use of the collection? • What are “meaningful” differences? • Topic design • Negotiation transcript? • Inter-annotator agreement
Interactive Track Design • Evaluation measure • Precision-oriented? • Recall-oriented? • Effect of assessor disagreement
Relevance Feedback Task • Evaluation measure • Residual recall at B_Residual? • Two-stage feedback?
Some Open Questions • Test collection reusability • Unbiased estimates? Tight error bars? • Why can’t we beat Boolean??? • Different strategies? Detailed failure analysis? • Can we improve topic formulation? • Structured relevance relevance feedback? • Is OCR masking effects we need to see? • Is it time for a new collection? • Must it be de-duped? Is metadata needed? • Does Δscope invalidate the interactive task?