250 likes | 441 Views
Information Retrieval at NLC . Jianfeng Gao NLC Group, Microsoft Research China. Outline. People Projects Systems Researches. People. Jianfeng Gao, Microsoft Research, China Guihong Cao, Tianjin University, China Hongzhao He, Tianjin University, China
E N D
Information Retrieval at NLC Jianfeng Gao NLC Group, Microsoft Research China
Outline • People • Projects • Systems • Researches
People • Jianfeng Gao, Microsoft Research, China • Guihong Cao, Tianjin University, China • Hongzhao He, Tianjin University, China • Min Zhang, Tsinghua University, China • Jian-Yun Nie, Université de Montréal • Stephen Robertson, Microsoft Research, Cambridge • Stephen Walker, Microsoft Research, Cambridge
Systems • SMART (Master: HongZhao) • Traditional IR system – VSM, TFIDF • Hold more than 500M collection • Linix • Okapi (Master: Guihong) • Modern IR system – Probabilistic Model, BM25 • Hold more than 10G collection • Windows2000
Projects • CLIR – TREC-9 ( Japanese NTCIR-3) • System: SMART • Focus: • Chinese Indexing Unit [Gao et al, 00] [Gao&He, 01] • Query translation [Gao et al, 01] • Web Retrieval – TREC-10 • System: Okapi • Focus: • Blind Feedback … [Zhang et al, 01] • Link-based retrieval (anchor text)… [Craswell et al, 01]
Researches • Best indexing unit for Chinese IR • Query translation • Using link information for web retrieval • Blind feedback for web retrieval • Improving the effectiveness of IR with clustering and Fusion
Best indexing unit for Chinese IR • Motivation • What is the basic unit of indexing in Chinese IR – word, n-gram, or combined … • Does the accuracy of word segmentation have a significant impact on IR performance • Experiment1 – indexing units • Experiment2 – the impact of word segmentation
Experiment1 – settings • System – SMART (modified version) • Corpus – TREC5&6 Chinese collection • Experiments • Impact of dict. – using the longest matching with a small dict. and with a large dict. • Combining the first method with single characters • Using full segmentation • Using bi-grams and uni-grams (characters) • Combining words with bi-grams and characters • Unknown word detection using NLPWin
Experiment1 – results • Word + character + (bigram) + unknown words
Experiment2 – settings • System • SMART System • Songrou’s Segmentation & Evaluation System • Corpus • (1) Trec 5&6 for Chinese IR • (2) Songrou’s Corpus • 12rst.txt 181KB • 12rst.src 250KB ( Standard segmentation of 12rst.txt made by linguists ) • (3) Sampling from Songrou’s Coupus • test.txt 20KB ( Random sampling from 12rst.txt ) • standard.src 28KB ( Standard segmentation corresponding to test.txt )
Experiment2 – results Notes A: 1 Baseline; 2 Disambiguration; 3 Number; 4 Propernoun; 5 Suffix Notes B: Feedback parameters are (10, 500, 0.5, 0.5 ) and (100, 500, 0.5, 0.5 )
Query Translation • Motivation – problems of simple lexicon-based approaches • Lexicon is incomplete • Difficult to select correct translations • Solution – improved lexicon-based approach • Term disambiguation using co-occurrence • Phrase detecting and translation using LM • Translation coverage enhancement using TM
Term disambiguation • Assumption – correct translation words tend to co-occur in Chinese language • A greedy algorithm: • for English terms Te = (e1…en), • find their Chinese translations Tc = (c1…cn), such that Tc = argmax SIM(c1, …, cn) • Term-similarity matrix – trained on Chinese corpus
Phrase detection and translation • Multi-word phrase is detected by base NP detector • Translation pattern (PATTe), e.g. • <NOUN1 NOUN2> <NOUN1 NOUN2> • <NOUN1 of NOUN2> <NOUN2 NOUN1> • Phrase translation: • Tc = argmax P(OTc|PATTe)P(Tc) • P(OTc|PATTe): prob. of the translation pattern • P(Tc): prob. of the phrase in Chinese LM
Using translation model (TM) • Enhance the coverage of the lexicon • Using TM • Tc = argmax P(Te|Tc)SIM(Tc) • Mining parallel texts from the Web for TM training
Experiments on TREC-5&6 • Monolingual • Simple translation: lexicon looking up • Best-sense translation: 2 + manually selecting • Improved translation (our method) • Machine translation: using IBM MT system
Translation Method Avg.P. % of Mono. IR 1 Monolingual 0.5150 2 Simple translation (m-mode) 0.2722 52.85% 3 Simple translation (u-mode) 0.3041 59.05% 4 Best-sense translation 0.3762 73.05% 5 Improved translation 0.3883 75.40% 6 Machine translation 0.3891 75.55% 7 5 + 6 0.4400 85.44% Summary of Experiments
Using link information for web retrieval • Motivation • The effectiveness of link-based retrieval • The evaluation on TREC web collection • Link-based Web retrieval – the state-of-the-art • Recommendation – high in-degree is better • Topic locality – connected pages are similar • Anchor description – represented by anchor text • Link-based retrieval in TREC – No good results
Experiments on TREC-9 • Baseline – Content based IR • Anchor description • Used alone – Much worse than baseline • Combined with content description – trivial improvement • Re-ranking – trivial improvement • Spreading – No positive effect
Blind feedback for web retrieval • Motivation • Web query is short • Web collection is huge and highly mixed • Blind feedback – refine web queries • Using global web collection • Using local web collection • Using other well-organized collection, i.e. Encarta
Experiments on TREC-9 • Baseline – 2-stage pseudo-relevance feedback (PFB) using global web collection • Local context analysis [Xu et al., 96] – 2-stage PFB using local web collection retrieved by the first stage • 2-stage PFB using Encarta collection in the first stage
Summary of Experiments • ???
Improving the effectiveness of IR with clustering and Fusion • Clustering Hypothesis – Documents that are relevant to the same query are more similar than non-relevant documents, and can be clustered together. • Fusion Hypothesis – Different ranked lists usually have a high overlap of relevant documents and a low overlap of non-relevant documents.