1 / 19

STIR:

STIR:. Simultaneous Achievement of high Precision and high Recall through S ocio- T echnical I nformation R etrieval Robert S. Bauer, Teresa Jade www.H5technologies.com & Mitchell P. Marcus www.cis.upenn.edu/~mitch/. June 7, 2007. The e-Discovery IDEAL: High P with High R.

aya
Download Presentation

STIR:

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. STIR: Simultaneous Achievement ofhigh Precision and high Recall throughSocio-Technical Information Retrieval Robert S. Bauer, Teresa Jadewww.H5technologies.com & Mitchell P. Marcus www.cis.upenn.edu/~mitch/ June 7, 2007

  2. The e-Discovery IDEAL: High P with High R • Find every relevant document& only those docs that are relevant • Desired P=0.8 (or better)@R=0.8 (or better) • Acceptable P=2/3(or better)@R=2/3(or better) 1

  3. The e-Discovery REALITY High P & Low R= RISK (important docs not retrieved) TextREtrivalConference Low P & High R= COST (many more documents must be reviewed) 1

  4. Agenda • Results • TREC ad hoc (= typical) • Queries typifying Communities of Practice (CoPs) • e-Discovery Approaches • 5 Dimensions • Linguistics of CoPs • Research Issues • TREC • AI • Linguists • Lawyers 2

  5. Typical Results – ad hoc queries 22 Topics Average • Desiredis Rare • Acceptable< 10% (from Chapter 3, “Retrieval System Evaluation” by Chris Buckley and Ellen M. Voorhees, inTREC: Experiment and Evaluation in Information Retrieval, Voorhees & Harman, ed., MIT Press, 2005, p62, Fig. 3.1) 3

  6. Ideal Acceptable F1 = 2.(P.R)/(P+R) TREC avg I II III IV Accuracy Metrics compared with STIR topical avg in 4 cases (I-IV) encompassing 42 topics Most accurate TREC results for 20 of 22 topics in one test case 4

  7. Average P & R for each case STIR compared with TREC IR STIR TREC Precision Recall Topical P & R results for one TREC and 4 STIR cases 5

  8. ● STIR training provides substantial Recall improvement with acceptable Precision reduction Retrieval Acceptableto lowest limitof statistical uncertainty Recall Improvement Precision Recall Sampled Corpus Tests for 12 Topics in case I during STIR Training 5

  9. Agenda • Results • TREC ad hoc (= typical) • Queries typifying Communities of Practice (CoPs) • e-Discovery Approaches • 5 Dimensions • Linguistics of CoPs • Research Issues • TREC • AI • Linguists • Lawyers 6

  10. Documents Community Linguistics SubjectMatter LegalCase Dimensions of e-Discovery 7

  11. Documents LegalCase Dimensions of e-Discovery: Document Review Example Systems: • Manual (human) review conducted by attorneys • Basic keyword searches targeted to legal issues • Supervised learning with relevance feedback 7

  12. Documents SubjectMatter LegalCase Dimensions of e-Discovery: Expert Search Example Systems: • Subject matter experts reviewresults under legal team direction ● Domain-specificlexicons used 7

  13. Documents Linguistics SubjectMatter LegalCase Dimensions of e-Discovery: Model Meaning Example Systems: • Supervised learning with • relevance feedback • semantic analysis ● Semantic search 7

  14. Documents Community Linguistics SubjectMatter LegalCase Dimensions of e-Discovery: Model Communities Example System: ● Socio-Technical-IR 7

  15. Community Linguistics Dimensions of e-Discovery: Socio-Technical-IR • Non-computational Linguistic Disciplines • Pragmatics • Socio-Linguistics • Ethno-Methodology • Discourse Analysis • A community of practice is • a diverse group of people • engaged in real work • over a significant period of time • developing their own tools, language, and processes • during which they build things, solve problems, learn and invent • evolving a practice that is highly skilled and highly creative 7

  16. Agenda • Results • TREC ad hoc (= typical) • Queries typifying Communities of Practice (CoPs) • e-Discovery Approaches • 5 Dimensions • Linguistics of CoPs • Research Issues • TREC • AI • Linguists • Lawyers 8

  17. Research Issues • TREC • Nature of the relatively rare high P with high R queries • Measuring both recall and precision effectively • AI • Knowledge-Based (Expert) Systems that codify linguistic expertise • Characterize practice communities of subject matter experts • Investigate combination systems applied to different types of topics • Linguists • Identify and characterize different types of topics and map to system types • Language patterns in communities as well as subject matter fields • Defining categories in concrete terms • Lawyers • Defining categories in concrete terms • Integration of technology and processes 9

  18. Back-Up

  19. STIR Analysis: CoPs’ Enunciatory language Object Relevant Document Text Process State of Affairs Event Action Fact

More Related