1 / 14

Overview (in English)

Information Retrieval and Extraction 2009 Term Project – Modern Web Search Advisor: 陳信希 TA: 蔡銘峰&許名宏. Overview (in English). The goal Using advanced approaches to improve the performance of basic IR models Group 1~3 person(s) per group; email the name list to the TA Approach

jvelazquez
Download Presentation

Overview (in English)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Retrieval and Extraction2009 Term Project – Modern Web SearchAdvisor: 陳信希TA: 蔡銘峰&許名宏

  2. Overview (in English) • The goal • Using advanced approaches to improve the performance of basic IR models • Group • 1~3 person(s) per group; email the name list to the TA • Approach • No limitations; Any resources on the Web is usable. • Date of system demo and report submission • 6/18 Thursday (provisional) • Criteria for the grade • Originality and reasonableness of your approach • Effort for implementation / per person • Retrieval performance (training & testing) • Completeness of the report, division of the work and analysis for the retrieval results

  3. Overview (in Chinese) • 專題目標 • 以進階IR技術提升基本檢索模型的效能 • 分組 • 1~3人/組,請組長將組員名單(學號、姓名) e-mail給TA • 方法 • 不限,可使用任何 toolkit or resource on web • Demo及報告繳交 • 暫定6/18 Thursday • 評分標準 • 所採用的方法創意、合理性 • Effort for implementation / per person • 檢索效能 (training & testing) • 報告完整性、分工及檢索結果分析

  4. Content in the Report • Detail description about your approach • Parameter setting (if parametric) • System performance on the training topics • The baseline performance • The performance of your approach • Division of the work (如何分工) • What you have learned (心得) • Others (optional)

  5. Basic IR Models • Vector space model • Lucene • Probabilistic model • Okapi-BM25 • Language model • Indri (Lemur toolkit)

  6. Possible Approaches • Pseudo relevance feedback (PRF) • Supported by Lemur API • Simple and effective, but no originality • Query expansion • Using external resources ex: WordNet, Wikipedia, query log ...etc • Word sense disambiguation in docs/query • Combining Results from 2 or more IR systems • Learning to rank • What are the useful features? • Others

  7. Experimental Dataset • A partial collection of TREC WT10g • Link information is provided • 30 topics for system development • Another 30 topics for the demo

  8. Topic Example <top> <num> Number: 476 <title> Jennifer Aniston <desc> Description: Find documents that identify movies and/or television programs that Jennifer Aniston has appeared in. <narr> Narrative: Relevant documents include movies and/or television programs that Jennifer Aniston has appeared in. </top>

  9. Document Example <DOC> <DOCNO>WTX010-B01-2</DOCNO> <DOCOLDNO>IA011-000115-B026-169</DOCOLDNO> <DOCHDR> http://www.lpitr.state.sc.us:80/reports/jsrf14.htm 167.7.18.68 19970216181104 text/html 264 HTTP/1.0 200 OK Date: Sunday, 16-Feb-97 18:19:32 GMT Server: NCSA/SMI-1.0 MIME-version: 1.0 Content-type: text/html Last-modified: Friday, 02-Feb-96 19:51:15 GMT Content-length: 82 </DOCHDR> <sup>1</sup> Mr. Delleney did not participate in deliberation of this candidate. </DOC>

  10. Link Information • In-links • “A B C” B and C contain links to A ex: WTX010-B01-118 WTX010-B01-114 WTX010-B01-121 • Out-links • “A B C” A contains links pointed to B or C ex: WTX010-B01-127 WTX010-B01-89 WTX010-B01-119

  11. Evaluation • Evaluate top 100 retrieved documents • Evaluation metrics • Mean average precision (MAP) • NDCG@15 • Use the program “ireval”to evaluate system performance • Usage of ireval

  12. Example Result for Evaluation 465 Q0 WTX017-B13-74 1 5 test 465 Q0 WTX017-B38-11 2 4.5 test 465 Q0 WTX017-B38-41 3 4.3 test 465 Q0 WTX017-B38-42 4 4.2 test 465 Q0 WTX017-B40-46 5 4.1 test 465 Q0 WTX018-B44-359 6 3.5 test 465 Q0 WTX018-B44-300 7 3 test 465 Q0 WTX012-B01-121 8 2.5 test 465 Q0 WTX019-B37-27 9 2 test 465 Q0 WTX019-B37-31 10 1.9 test 474 Q0 WTX012-B01-151 1 9 test 474 Q0 WTX017-B38-46 2 8 test 474 Q0 WTX018-B44-35 3 7 test 474 Q0 WTX013-B03-335 4 6 test 474 Q0 WTX018-B44-30 5 5 test 474 Q0 WTX015-B25-285 6 4 test 474 Q0 WTX019-B37-27 7 3 test 474 Q0 WTX014-B39-281 8 2 test 474 Q0 WTX018-B14-294 9 1.5 test 474 Q0 WTX018-B20-109 10 1 test

  13. Dataset Description (1/2) • “training_topics.txt” (file) • 30 topics for system development • “qrels_training_topics.txt” (file) • Relevance judgments for training topics • “documents” (directory) • Including 10 .rar files of raw documents • “in_links.txt” (file) • In-link information • “out_links.txt” (file) • Out-link information

  14. Dataset Description (2/2) • “ireval.jar” (file) • A Java program for evaluation • “irevalGUI.jar” (file) • GUI of ireval.jar

More Related