Information Retrieval and Extraction 2010 Term Project – Modern Web Search Advisor: 陳信希 TA: 許名宏 & 王界人

Information Retrieval and Extraction2010 Term Project – Modern Web SearchAdvisor: 陳信希TA: 許名宏 & 王界人

Overview (in English) • Goal • Using advanced approaches to enhance Okapi-BM25 • Group • 1~3 person(s) per group; email the name list to the TA • Approach • No limitations; Any resources on the Web is usable. • Date of system demo and report submission • 6/24 Thursday (provisional) • Grading criteria • Originality and reasonableness of your approach • Effort for implementation / per person • Retrieval performance (training & testing) • Completeness of the report (分工、結果分析)

Overview (in Chinese) • 專題目標 • 以進階IR技術提升Okapi-BM25的效能 • 分組 • 1~3人/組，請組長將組員名單(學號、姓名) e-mail給TA • 方法 • 不限，可使用任何 toolkit or resource on Web • Demo及報告繳交 • 6/25 Friday • 評分標準 • 所採用的方法創意、合理性 • Effort of implementation / per person • 檢索效能 (training 、 testing) • 報告完整性、分工及檢索結果分析

Content of Report • Detail description about your approach • Parameter setting (if parametric) • System performance on the training topics • The baseline (Okapi-BM25) performance • The performance of your approach • Division of the work (如何分工) • What you have learned (心得) • Others (optional)

Baseline Implementation: Okapi-BM25 • Parametric probabilistic model • Parameter setting • k1=1.2, k2=0, k3=0, b =0.75, R =r =0 (initial guess) • Stemming: Porter’s stemmer

Possible Approaches • Pseudo relevance feedback (PRF) • Supported by Lemur API • Simple and effective, but no originality • Query expansion • Using external resources ex: WordNet, Wikipedia, query log (AOL) ...etc • Word sense disambiguation in docs/query • Combining Results from 2 or more IR systems • Latent semantic analysis (LSI) • Others • learning to rank, clustering/classification, …

Experimental Dataset • A partial collection of TREC WT10g • ~10k documents • Link information is provided • 30 topics for system development (training) • Another 20 topics in demo (testing)

Topic Example <top> <num> Number: 476 <title> Jennifer Aniston <desc> Description: Find documents that identify movies and/or television programs that Jennifer Aniston has appeared in. <narr> Narrative: Relevant documents include movies and/or television programs that Jennifer Aniston has appeared in. </top>

Document Example <DOC> <DOCNO>WTX010-B01-2</DOCNO> <DOCOLDNO>IA011-000115-B026-169</DOCOLDNO> <DOCHDR> http://www.lpitr.state.sc.us:80/reports/jsrf14.htm 167.7.18.68 19970216181104 text/html 264 HTTP/1.0 200 OK Date: Sunday, 16-Feb-97 18:19:32 GMT Server: NCSA/SMI-1.0 MIME-version: 1.0 Content-type: text/html Last-modified: Friday, 02-Feb-96 19:51:15 GMT Content-length: 82 </DOCHDR> <sup>1</sup> Mr. Delleney did not participate in deliberation of this candidate. </DOC>

Link Information • For approaches with PageRank/HITS • In-links • “A B C” B and C contain links to A ex: WTX010-B01-118 WTX010-B01-114 WTX010-B01-121 • Out-links • “A B C” A contains links pointed to B or C ex: WTX010-B01-127 WTX010-B01-89 WTX010-B01-119

Evaluation • Evaluate top 100 retrieved documents • Evaluation metrics • Mean average precision (MAP) • P@20 • Use the program “trec_eval”to evaluate system performance • Usage of trec_eval

Example Result for Evaluation (topic-num) (dummy) (docno) (rank) (score) (run-tag) 465 Q0 WTX017-B13-74 1 5 test 465 Q0 WTX017-B38-11 2 4.5 test 465 Q0 WTX017-B38-41 3 4.3 test 465 Q0 WTX017-B38-42 4 4.2 test 465 Q0 WTX017-B40-46 5 4.1 test 465 Q0 WTX018-B44-359 6 3.5 test 465 Q0 WTX018-B44-300 7 3 test 465 Q0 WTX012-B01-121 8 2.5 test 465 Q0 WTX019-B37-27 9 2 test 465 Q0 WTX019-B37-31 10 1.9 test 474 Q0 WTX012-B01-151 1 9 test 474 Q0 WTX017-B38-46 2 8 test 474 Q0 WTX018-B44-35 3 7 test 474 Q0 WTX013-B03-335 4 6 test 474 Q0 WTX018-B44-30 5 5 test 474 Q0 WTX015-B25-285 6 4 test 474 Q0 WTX019-B37-27 7 3 test 474 Q0 WTX014-B39-281 8 2 test 474 Q0 WTX018-B14-294 9 1.5 test

Example of Relevance Judgments (topic-num) (dummy) (docno) (relevance) 465 0 WTX017-B13-74 1 465 0 WTX017-B38-46 1 465 0 WTX018-B44-359 1 465 0 WTX019-B37-27 2 474 0 WTX012-B01-151 1 474 0 WTX013-B03-335 1 474 0 WTX014-B39-281 1 474 0 WTX015-B25-285 1 474 0 WTX018-B20-109 2 474 0 WTX018-B14-294 1

Summary of What to Do • Okapi-BM25 implementation (baseline) • With the fixed settings • Evaluate the baseline approach with training topics • using terms in <title> as query • Survey or design your enhanced approach • Evaluate and optimize your approach with training topics • Submit report and demo with testing topics • Evaluate Okapi-BM25 and your approach with testing topics

Dataset Description (1/2) • “training_topics.txt” (file) • 30 topics for system development • “qrels_training_topics.txt” (file) • Relevance judgments for training topics • “documents” (directory) • Including 10 .rar files of raw documents • “in_links.txt” (file) • In-link information • “out_links.txt” (file) • Out-link information

Dataset Description (2/2) • “trec_eval.exe” (file) • Binary evaluation program • “trec_eval.8.1.rar” (file) • Source of trec_eval for making in UNIX

Information Retrieval and Extraction 2010 Term Project – Modern Web Search Advisor: 陳信希 TA: 許名宏 & 王界人