140 likes | 152 Views
Information Retrieval and Extraction 2009 Term Project – Modern Web Search Advisor: 陳信希 TA: 蔡銘峰&許名宏. Overview (in English). The goal Using advanced approaches to improve the performance of basic IR models Group 1~3 person(s) per group; email the name list to the TA Approach
E N D
Information Retrieval and Extraction2009 Term Project – Modern Web SearchAdvisor: 陳信希TA: 蔡銘峰&許名宏
Overview (in English) • The goal • Using advanced approaches to improve the performance of basic IR models • Group • 1~3 person(s) per group; email the name list to the TA • Approach • No limitations; Any resources on the Web is usable. • Date of system demo and report submission • 6/18 Thursday (provisional) • Criteria for the grade • Originality and reasonableness of your approach • Effort for implementation / per person • Retrieval performance (training & testing) • Completeness of the report, division of the work and analysis for the retrieval results
Overview (in Chinese) • 專題目標 • 以進階IR技術提升基本檢索模型的效能 • 分組 • 1~3人/組,請組長將組員名單(學號、姓名) e-mail給TA • 方法 • 不限,可使用任何 toolkit or resource on web • Demo及報告繳交 • 暫定6/18 Thursday • 評分標準 • 所採用的方法創意、合理性 • Effort for implementation / per person • 檢索效能 (training & testing) • 報告完整性、分工及檢索結果分析
Content in the Report • Detail description about your approach • Parameter setting (if parametric) • System performance on the training topics • The baseline performance • The performance of your approach • Division of the work (如何分工) • What you have learned (心得) • Others (optional)
Basic IR Models • Vector space model • Lucene • Probabilistic model • Okapi-BM25 • Language model • Indri (Lemur toolkit)
Possible Approaches • Pseudo relevance feedback (PRF) • Supported by Lemur API • Simple and effective, but no originality • Query expansion • Using external resources ex: WordNet, Wikipedia, query log ...etc • Word sense disambiguation in docs/query • Combining Results from 2 or more IR systems • Learning to rank • What are the useful features? • Others
Experimental Dataset • A partial collection of TREC WT10g • Link information is provided • 30 topics for system development • Another 30 topics for the demo
Topic Example <top> <num> Number: 476 <title> Jennifer Aniston <desc> Description: Find documents that identify movies and/or television programs that Jennifer Aniston has appeared in. <narr> Narrative: Relevant documents include movies and/or television programs that Jennifer Aniston has appeared in. </top>
Document Example <DOC> <DOCNO>WTX010-B01-2</DOCNO> <DOCOLDNO>IA011-000115-B026-169</DOCOLDNO> <DOCHDR> http://www.lpitr.state.sc.us:80/reports/jsrf14.htm 167.7.18.68 19970216181104 text/html 264 HTTP/1.0 200 OK Date: Sunday, 16-Feb-97 18:19:32 GMT Server: NCSA/SMI-1.0 MIME-version: 1.0 Content-type: text/html Last-modified: Friday, 02-Feb-96 19:51:15 GMT Content-length: 82 </DOCHDR> <sup>1</sup> Mr. Delleney did not participate in deliberation of this candidate. </DOC>
Link Information • In-links • “A B C” B and C contain links to A ex: WTX010-B01-118 WTX010-B01-114 WTX010-B01-121 • Out-links • “A B C” A contains links pointed to B or C ex: WTX010-B01-127 WTX010-B01-89 WTX010-B01-119
Evaluation • Evaluate top 100 retrieved documents • Evaluation metrics • Mean average precision (MAP) • NDCG@15 • Use the program “ireval”to evaluate system performance • Usage of ireval
Example Result for Evaluation 465 Q0 WTX017-B13-74 1 5 test 465 Q0 WTX017-B38-11 2 4.5 test 465 Q0 WTX017-B38-41 3 4.3 test 465 Q0 WTX017-B38-42 4 4.2 test 465 Q0 WTX017-B40-46 5 4.1 test 465 Q0 WTX018-B44-359 6 3.5 test 465 Q0 WTX018-B44-300 7 3 test 465 Q0 WTX012-B01-121 8 2.5 test 465 Q0 WTX019-B37-27 9 2 test 465 Q0 WTX019-B37-31 10 1.9 test 474 Q0 WTX012-B01-151 1 9 test 474 Q0 WTX017-B38-46 2 8 test 474 Q0 WTX018-B44-35 3 7 test 474 Q0 WTX013-B03-335 4 6 test 474 Q0 WTX018-B44-30 5 5 test 474 Q0 WTX015-B25-285 6 4 test 474 Q0 WTX019-B37-27 7 3 test 474 Q0 WTX014-B39-281 8 2 test 474 Q0 WTX018-B14-294 9 1.5 test 474 Q0 WTX018-B20-109 10 1 test
Dataset Description (1/2) • “training_topics.txt” (file) • 30 topics for system development • “qrels_training_topics.txt” (file) • Relevance judgments for training topics • “documents” (directory) • Including 10 .rar files of raw documents • “in_links.txt” (file) • In-link information • “out_links.txt” (file) • Out-link information
Dataset Description (2/2) • “ireval.jar” (file) • A Java program for evaluation • “irevalGUI.jar” (file) • GUI of ireval.jar