210 likes | 470 Views
STIR:. Simultaneous Achievement of high Precision and high Recall through S ocio- T echnical I nformation R etrieval Robert S. Bauer, Teresa Jade www.H5technologies.com & Mitchell P. Marcus www.cis.upenn.edu/~mitch/. June 7, 2007. The e-Discovery IDEAL: High P with High R.
E N D
STIR: Simultaneous Achievement ofhigh Precision and high Recall throughSocio-Technical Information Retrieval Robert S. Bauer, Teresa Jadewww.H5technologies.com & Mitchell P. Marcus www.cis.upenn.edu/~mitch/ June 7, 2007
The e-Discovery IDEAL: High P with High R • Find every relevant document& only those docs that are relevant • Desired P=0.8 (or better)@R=0.8 (or better) • Acceptable P=2/3(or better)@R=2/3(or better) 1
The e-Discovery REALITY High P & Low R= RISK (important docs not retrieved) TextREtrivalConference Low P & High R= COST (many more documents must be reviewed) 1
Agenda • Results • TREC ad hoc (= typical) • Queries typifying Communities of Practice (CoPs) • e-Discovery Approaches • 5 Dimensions • Linguistics of CoPs • Research Issues • TREC • AI • Linguists • Lawyers 2
Typical Results – ad hoc queries 22 Topics Average • Desiredis Rare • Acceptable< 10% (from Chapter 3, “Retrieval System Evaluation” by Chris Buckley and Ellen M. Voorhees, inTREC: Experiment and Evaluation in Information Retrieval, Voorhees & Harman, ed., MIT Press, 2005, p62, Fig. 3.1) 3
Ideal Acceptable F1 = 2.(P.R)/(P+R) TREC avg I II III IV Accuracy Metrics compared with STIR topical avg in 4 cases (I-IV) encompassing 42 topics Most accurate TREC results for 20 of 22 topics in one test case 4
Average P & R for each case STIR compared with TREC IR STIR TREC Precision Recall Topical P & R results for one TREC and 4 STIR cases 5
● STIR training provides substantial Recall improvement with acceptable Precision reduction Retrieval Acceptableto lowest limitof statistical uncertainty Recall Improvement Precision Recall Sampled Corpus Tests for 12 Topics in case I during STIR Training 5
Agenda • Results • TREC ad hoc (= typical) • Queries typifying Communities of Practice (CoPs) • e-Discovery Approaches • 5 Dimensions • Linguistics of CoPs • Research Issues • TREC • AI • Linguists • Lawyers 6
Documents Community Linguistics SubjectMatter LegalCase Dimensions of e-Discovery 7
Documents LegalCase Dimensions of e-Discovery: Document Review Example Systems: • Manual (human) review conducted by attorneys • Basic keyword searches targeted to legal issues • Supervised learning with relevance feedback 7
Documents SubjectMatter LegalCase Dimensions of e-Discovery: Expert Search Example Systems: • Subject matter experts reviewresults under legal team direction ● Domain-specificlexicons used 7
Documents Linguistics SubjectMatter LegalCase Dimensions of e-Discovery: Model Meaning Example Systems: • Supervised learning with • relevance feedback • semantic analysis ● Semantic search 7
Documents Community Linguistics SubjectMatter LegalCase Dimensions of e-Discovery: Model Communities Example System: ● Socio-Technical-IR 7
Community Linguistics Dimensions of e-Discovery: Socio-Technical-IR • Non-computational Linguistic Disciplines • Pragmatics • Socio-Linguistics • Ethno-Methodology • Discourse Analysis • A community of practice is • a diverse group of people • engaged in real work • over a significant period of time • developing their own tools, language, and processes • during which they build things, solve problems, learn and invent • evolving a practice that is highly skilled and highly creative 7
Agenda • Results • TREC ad hoc (= typical) • Queries typifying Communities of Practice (CoPs) • e-Discovery Approaches • 5 Dimensions • Linguistics of CoPs • Research Issues • TREC • AI • Linguists • Lawyers 8
Research Issues • TREC • Nature of the relatively rare high P with high R queries • Measuring both recall and precision effectively • AI • Knowledge-Based (Expert) Systems that codify linguistic expertise • Characterize practice communities of subject matter experts • Investigate combination systems applied to different types of topics • Linguists • Identify and characterize different types of topics and map to system types • Language patterns in communities as well as subject matter fields • Defining categories in concrete terms • Lawyers • Defining categories in concrete terms • Integration of technology and processes 9
STIR Analysis: CoPs’ Enunciatory language Object Relevant Document Text Process State of Affairs Event Action Fact