290 likes | 461 Views
WBIA Project 2 – Retrieval & Evaluation. LI Geng Nov.10, 2008. Guidelines. Information retrieval evaluation – a brief review Goals of this assignment Tools & work environment Nutch-0.9 Lucene-2.1.0 Assignment instructions Submission & grading policies. Previously in Project 1 - Crawling.
E N D
WBIA Project 2 – Retrieval & Evaluation LI Geng Nov.10, 2008
Guidelines • Information retrieval evaluation – a brief review • Goals of this assignment • Tools & work environment • Nutch-0.9 • Lucene-2.1.0 • Assignment instructions • Submission & grading policies
Previously in Project 1 - Crawling • Tool: Nutch • Target network: ccer.pku.edu.cn • What we already have: • A web database that contains web pages of CCER; • Inverted index of your data (you may not have noticed yet); • Global PageRank results
Previously in Project 1 (Cont.) • What we don’t have yet for a complete IR service: • Interpreting user information need • Query Web page (at least page urls) • Online retrieval service.
I. Information Retrieval Evaluation – A Brief Review • Project 2’s Focus: Query Web Page • What do we need to evaluate retrieval results? • Retrieval model implementation & optimization; • A standard test data set; • Pre-defined queries and their corresponding answer set; • Evaluating with well-known metrics (MAP, P@10, etc.)
II. Goals of this Assignment Setup an online web search engine (using Nutch) Understand information retrieval evaluation process Refine existing retrieval model (by enhancing evaluation metric scores)
How? • A standard web page test set (Done.) • Pre-defined queries and their corresponding answer set (Done.) • Retrieval model implementation • Evaluating with well-known metrics (MAP, P@10, etc.)
III. Tools & work environment • Nutch’s major modules: • Crawling • Indexing • Retrieval • Web search • … • Of which indexing and retrieval modules are built on top ofLucene.
Lucene • A framework for document retrieval using the Vector Space Model • Inverted index construction • Query matching
Lucene (Cont.) • It does not handle (from http://darksleep.com/lucene): • managing the process (instantiating the objects and hooking them together, both for indexing and for searching) • selecting the data files • parsing the data files (例如:中文切词) • getting the search string from the user • displaying the search results to the user • A “library” rather than a stand-alone application
Lucene (Cont.) • But a library with useful utilities as standard extensions • E.g. • package org.apache.lucene.analysis.standard; • Default document analysis (and tokenizing) utilities (i.e. they will be used if you don’t implement your onwn.)
Lucene in Nutch org.apache. lucene.analysis org.apache. lucene.index Crawled Web Page Web Page Posting Lists Inverted Index org.apache. lucene.search org.apache. lucene.index HitSet Matched Documents • As a third-party library • try listing the $NUTCH-HOME/lib directory
Lucene in Nutch (Cont.) Index Construction Retrieval • Nutch implements Lucene interfaces and imports Lucene classes so as to reuse its indexing and retrieval functionalities. • E.g. • Inpackage org.apache.nutch.analysis; • public final class NutchDocumentTokenizer extends org.apache.lucene.analysis.Tokenizer implements NutchAnalysisConstants • Refer to these packages for more details: • package org.apache.nutch.indexer; • package org.apache.nutch.analysis; • package org.apache.nutch.searcher;
Towards a complete IR Application • Nutch’s major modules: • Crawling • Indexing • Try listing the root directory of your WebDB: • Crawldb indexes linkdb segments • Retrieval • Web search • …
IV. Assignment Instructions • The test set and answer set: • Taken from one group’s previous crawl • Will be put online soon • Retrieval • Enhance retrieval quality using your PageRank results • Web search • Set up online search engine with Nutch
Step 1 - Web Search Engine Setup • This is the recommended first step in this assignment. • It is relatively simple; Nutch’s online tutorial has detailed enough information on this. • http://wiki.apache.org/nutch/NutchTutorial • You will have an impression of the vector space retrieval model implemented by Lucene. • Important: To save time with Nutch configuration, refer to my instructions in addition to the Nutch online tutorial at • http://162.105.80.59/WBIA_NutchConfigHelp.txt
Step 1 - Web Search Engine Setup (Cont.) • Your task: • Compute retrieval metrics as the base for comparison • MAP, P@10
Step 2 – Lucene Retrieval Ranking Analysis • Entry point: • class org.apache.lucene.search.IndexSearcher • (Hint)Related class, for reference: • class org.apache.lucene.search.BooleanQuery • class org.apache.lucene.search.BooleanQuery. BooleanWeight
Step 2 – Lucene Retrieval Ranking Analysis (Cont.) • Your task: • Figure out the formula of score computing.
Step 3 – Integrate PageRank results with VSM • Your task: • Figure out a solution to combine PageRank and VSM score effectively to enhance retrieval quality. • Any ideas now? • Required coding: edit • package org.apache.lucene.search
Step 4 – Re-evaluate and Improve • Based on your new model and retrieval results, recompute • MAP, P@10 • Compare newly computed values with previous ones, go back to step 3 if there is still room for improvement.
Challenge Task 1 • Edit Lucene to implement the language model (and repeat the evaluation process, compare results with VSM + PageRank) • Hint: • Find out how Lucene stores and reads the posting lists, and figure out a way to use the data in them for LM similarity computing. • Or, you may consider reformatting the posting list store and insert additional useful information.
Challenge Task 2 • Implement LSI (Latent Semantic Indexing) and evalute • In this case, could Lucene’s document scoring module still be reused? • …
V. Submission & Grading Deadline: 12.3 23:59 Challenge属于选做内容
提交内容 • 工程报告文档,包含以下部分: • 小组成员及分工 • Lucene进行文档匹配的评分计算公式; • 如何将PageRank的计算结果整合进来? • 讲思路,不要贴程序代码。 • 整合的效果如何?整合后又做了哪些改进尝试? • 用两个评测指标说明 • (选做部分)简述实现语言模型或LSI的思路
提交内容(续) • 代码包 • 至少包括结合了VSM和PageRank文档排序算法的lucene jar包,并说明修改过的文件; • 如果做了Challenge,请在代码包内加上额外的文本文件说明; • 提交格式: • 将以上两部分打成zip或rar压缩包,命名格式: • (组名)_(Project leader学号).zip(rar)
Grading Policy • 起评:100 • Challenge 1: +30 bonus • Challenge 2: +40 bonus • 独力完成的小组至少可以得到75%的分数 • 根据完成情况,Project Leader有0 - 20%的奖励
Online References http://wiki.apache.org/nutch/NutchTutorial http://darksleep.com/lucene http://lucene.apache.org/java/2_1_0/