130 likes | 139 Views
Explore the implementation and research goals of an expert search system utilizing the W3C corpus for richer candidate representation and email discussion list modeling. Discover innovative approaches for forming associations and conducting evaluations.
E N D
Expert Search Project group 3
Introduction • Expert Search Task • Enterprise Track at TREC • W3C corpus (300.000) • Goals • Implement an expert search system • Conduct research • Richer candidate representation • Effect of e-mail discussion lists
Modeling expert search • Approach • Build language model for each document • Find a number of relevant documents (Topicality) • Different levels of associations • Motivation • How a user normally searches for an expert • Google search • Find relevant documents • Check names in documents
Initial search system • Minimal system • 2005 TREC entry • Formal models paper • Model 2
Initial associations • Forming associations: M0..M3 • M0: EXACT_MATCH • M1: NAME_MATCH • M2: LASTNAME_MATCH • M3: E-MAIL_MATCH Expert search is part of the larger field of enterprise search (David Hawking, 2004). According to D. Hawking enterprise search it includes searching through all possible sources of textual information within an organization……….Hawking performed experiments which show that anchor text proved to a good startingpoint for enterprise search. For more information e-mail to: Hawking@wc3.org...
Added associations • New associations ( M4..M7 ) • M4: FIRST_LAST_MATCH • M5:MAIL_HEADER_FROM • M6:MAIL_HEADER_TO • M7: MAIL_HEADER_CC To: Dan Connolly Cc: Maria Fernandez From: David Hawking Subject: Conferences Date: 21 Oct 2003 12:07 I’d like to send Smith to ADC2004. She’s entitled under section whatever on p.27 of the corporate manual. Jones wants to go but she already went on that junket to Maui. David L. Hawking
System additions • User Interface (GUI) • TREC evaluation framework • Special treatment of discussion lists • Cleaning the W3C Corpus • Signature detection • Quotation detection • Forming associations ( M4 .. M7 ) • Richer candidate representation • Detect multiple names and e-mail addresses • Entry page detection • Statistics on e-mail usage • Signatures
Evaluation • Improvement overview on P5
Evaluation (2) • Effect of topicality
Evaluation (3) • Signature detection • Only relevant signatures • Entry page • Remove non-relevant ( antivirus, disclaimers, yahoo ) • Statistics • extracted • distinct • candidates associated • distinct candidates 28000 signatures. 3788 signatures. 1179 candidates. 208 candidates.
System demonstration • 2 searches on basic interface • TREC interface
For further research • Reranking - discover the options of successful reranking, try to implement it in a fast and effective way (this is a hard task) • Structure E-mail discussion lists • Extract more from signatures
Questions • Are there any questions?