Information Retrieval at NLC

Information Retrieval at NLC Jianfeng Gao NLC Group, Microsoft Research China

Outline • People • Projects • Systems • Researches

People • Jianfeng Gao, Microsoft Research, China • Guihong Cao, Tianjin University, China • Hongzhao He, Tianjin University, China • Min Zhang, Tsinghua University, China • Jian-Yun Nie, Université de Montréal • Stephen Robertson, Microsoft Research, Cambridge • Stephen Walker, Microsoft Research, Cambridge

Systems • SMART (Master: HongZhao) • Traditional IR system – VSM, TFIDF • Hold more than 500M collection • Linix • Okapi (Master: Guihong) • Modern IR system – Probabilistic Model, BM25 • Hold more than 10G collection • Windows2000

Projects • CLIR – TREC-9 ( Japanese NTCIR-3) • System: SMART • Focus: • Chinese Indexing Unit [Gao et al, 00] [Gao&He, 01] • Query translation [Gao et al, 01] • Web Retrieval – TREC-10 • System: Okapi • Focus: • Blind Feedback … [Zhang et al, 01] • Link-based retrieval (anchor text)… [Craswell et al, 01]

Researches • Best indexing unit for Chinese IR • Query translation • Using link information for web retrieval • Blind feedback for web retrieval • Improving the effectiveness of IR with clustering and Fusion

Best indexing unit for Chinese IR • Motivation • What is the basic unit of indexing in Chinese IR – word, n-gram, or combined … • Does the accuracy of word segmentation have a significant impact on IR performance • Experiment1 – indexing units • Experiment2 – the impact of word segmentation

Experiment1 – settings • System – SMART (modified version) • Corpus – TREC5&6 Chinese collection • Experiments • Impact of dict. – using the longest matching with a small dict. and with a large dict. • Combining the first method with single characters • Using full segmentation • Using bi-grams and uni-grams (characters) • Combining words with bi-grams and characters • Unknown word detection using NLPWin

Experiment1 – results • Word + character + (bigram) + unknown words

Experiment2 – settings • System • SMART System • Songrou’s Segmentation & Evaluation System • Corpus • (1) Trec 5&6 for Chinese IR • (2) Songrou’s Corpus • 12rst.txt 181KB • 12rst.src 250KB ( Standard segmentation of 12rst.txt made by linguists ) • (3) Sampling from Songrou’s Coupus • test.txt 20KB ( Random sampling from 12rst.txt ) • standard.src 28KB ( Standard segmentation corresponding to test.txt )

Experiment2 – results Notes A: 1 Baseline; 2 Disambiguration; 3 Number; 4 Propernoun; 5 Suffix Notes B: Feedback parameters are (10, 500, 0.5, 0.5 ) and (100, 500, 0.5, 0.5 )

Query Translation • Motivation – problems of simple lexicon-based approaches • Lexicon is incomplete • Difficult to select correct translations • Solution – improved lexicon-based approach • Term disambiguation using co-occurrence • Phrase detecting and translation using LM • Translation coverage enhancement using TM

Term disambiguation • Assumption – correct translation words tend to co-occur in Chinese language • A greedy algorithm: • for English terms Te = (e1…en), • find their Chinese translations Tc = (c1…cn), such that Tc = argmax SIM(c1, …, cn) • Term-similarity matrix – trained on Chinese corpus

Phrase detection and translation • Multi-word phrase is detected by base NP detector • Translation pattern (PATTe), e.g. • <NOUN1 NOUN2>  <NOUN1 NOUN2> • <NOUN1 of NOUN2>  <NOUN2 NOUN1> • Phrase translation: • Tc = argmax P(OTc|PATTe)P(Tc) • P(OTc|PATTe): prob. of the translation pattern • P(Tc): prob. of the phrase in Chinese LM

Using translation model (TM) • Enhance the coverage of the lexicon • Using TM • Tc = argmax P(Te|Tc)SIM(Tc) • Mining parallel texts from the Web for TM training

Experiments on TREC-5&6 • Monolingual • Simple translation: lexicon looking up • Best-sense translation: 2 + manually selecting • Improved translation (our method) • Machine translation: using IBM MT system

Translation Method Avg.P. % of Mono. IR 1 Monolingual 0.5150 2 Simple translation (m-mode) 0.2722 52.85% 3 Simple translation (u-mode) 0.3041 59.05% 4 Best-sense translation 0.3762 73.05% 5 Improved translation 0.3883 75.40% 6 Machine translation 0.3891 75.55% 7 5 + 6 0.4400 85.44% Summary of Experiments

Using link information for web retrieval • Motivation • The effectiveness of link-based retrieval • The evaluation on TREC web collection • Link-based Web retrieval – the state-of-the-art • Recommendation – high in-degree is better • Topic locality – connected pages are similar • Anchor description – represented by anchor text • Link-based retrieval in TREC – No good results

Experiments on TREC-9 • Baseline – Content based IR • Anchor description • Used alone – Much worse than baseline • Combined with content description – trivial improvement • Re-ranking – trivial improvement • Spreading – No positive effect

Summary of Experiments

Blind feedback for web retrieval • Motivation • Web query is short • Web collection is huge and highly mixed • Blind feedback – refine web queries • Using global web collection • Using local web collection • Using other well-organized collection, i.e. Encarta

Experiments on TREC-9 • Baseline – 2-stage pseudo-relevance feedback (PFB) using global web collection • Local context analysis [Xu et al., 96] – 2-stage PFB using local web collection retrieved by the first stage • 2-stage PFB using Encarta collection in the first stage

Summary of Experiments • ???

Improving the effectiveness of IR with clustering and Fusion • Clustering Hypothesis – Documents that are relevant to the same query are more similar than non-relevant documents, and can be clustered together. • Fusion Hypothesis – Different ranked lists usually have a high overlap of relevant documents and a low overlap of non-relevant documents.

Thanks !

Information Retrieval at NLC

Information Retrieval at NLC

Presentation Transcript

Information retrieval

Information Retrieval

Information retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

information retrieval

Information Retrieval