TREC-9 CLIR Experiments at MSRCN

1. Jianfeng Gao Microsoft Research China (MSRCN) TREC-9 CLIR Experiments at MSRCN

2. People Jianfeng Gao, Microsoft Research China Jian-Yun Nie, Universit� de Montr�al Jian Zhang, Tsinghua University, China Endong Xun, Microsoft Research China Yi Su, Tsinghua University, China Ming Zhou, Microsoft Research China Changning Huang, Microsoft Research China This is a joint work of MSRCN, Montreal, and Tsinghua. Here is the name list of people who contribute to this work. This is a joint work of MSRCN, Montreal, and Tsinghua. Here is the name list of people who contribute to this work.

3. What is TREC ? A workshop series that provides the infrastructure for large-scale testing of text retrieval technology Realistic test collection Uniform, appropriate scoring procedures A forum for the exchange of research ideas and for the discussion of research methodology Sponsored by NIST, DARPA/ITO, ARDA

4. TREC-9 Task Tracks Cross-Language Information Retrieval (CLIR) Filtering Interactive Query Question Answering Spoken Document Retrieval Web Track

5. Given a topic in English, retrieve the top 1000 documents ranked by similarity to the topic from a collection of Chinese newspaper/wire documents. There were 9 groups who contributed to the design and actually ran the experiments. I also want to thank friends of the track who did not run experiments for their contributions to the design discussions. This talk is laid out as follows >>>There were 9 groups who contributed to the design and actually ran the experiments. I also want to thank friends of the track who did not run experiments for their contributions to the design discussions. This talk is laid out as follows >>>

6. 25 English topics (CH55-79) created at NIST Example: <num> Number: CH55 <title> World Trade Organization membership <desc> Description: What speculations on the effects of the entry of China or Taiwan into the World Trade Organization (WTO) are being reported in the Asian press? <narr> Narrative: Documents reporting support by other nations for China's or Taiwan's entry into the World Trade Organization (WTO) are not relevant. There were 9 groups who contributed to the design and actually ran the experiments. I also want to thank friends of the track who did not run experiments for their contributions to the design discussions. This talk is laid out as follows >>>There were 9 groups who contributed to the design and actually ran the experiments. I also want to thank friends of the track who did not run experiments for their contributions to the design discussions. This talk is laid out as follows >>>

7. 126,937 documents; 188 MB; Traditional Chinese, BIG5 encoding Sources: Hong Kong Commercial Data 11. Aug 98 - 31. Jul 99 Hong Kong Daily News 1. Feb 99 - 31. Jul 99 Takongpao 21. Oct 98 - 4. Mar 99 There were 9 groups who contributed to the design and actually ran the experiments. I also want to thank friends of the track who did not run experiments for their contributions to the design discussions. This talk is laid out as follows >>>There were 9 groups who contributed to the design and actually ran the experiments. I also want to thank friends of the track who did not run experiments for their contributions to the design discussions. This talk is laid out as follows >>>

8. BBN Technologies Fudan University IBM T.J. Watson Research Center Johns Hopkins University Korea Advanced Institute of Science and Technology Microsoft Research, China MNIS-TextWise Labs National Taiwan University There were 9 groups who contributed to the design and actually ran the experiments. I also want to thank friends of the track who did not run experiments for their contributions to the design discussions. This talk is laid out as follows >>>There were 9 groups who contributed to the design and actually ran the experiments. I also want to thank friends of the track who did not run experiments for their contributions to the design discussions. This talk is laid out as follows >>>

9. Queens College, CUNY RMIT University Telcordia Technologies, Inc. The Chinese University of Hong Kong Trans-EZ Inc. University of California at Berkeley University of Maryland University of Massachusetts There were 9 groups who contributed to the design and actually ran the experiments. I also want to thank friends of the track who did not run experiments for their contributions to the design discussions. This talk is laid out as follows >>>There were 9 groups who contributed to the design and actually ran the experiments. I also want to thank friends of the track who did not run experiments for their contributions to the design discussions. This talk is laid out as follows >>>

10. Outline Introduction Finding the best indexing units for Chinese IR Query translation Query expansion Experimental results in TREC 9 Conclusion In this talk. After a brief introduction to the system, and resource we used. I will describe our approaches in TREC 9 experiments, including (1) (2) (3) After that, let me show you some experimental results Finally, I give the conclusionIn this talk. After a brief introduction to the system, and resource we used. I will describe our approaches in TREC 9 experiments, including (1) (2) (3) After that, let me show you some experimental results Finally, I give the conclusion

11. Introduction (1) Participate for the first time in TREC System � modified version of SMART Pre-processing � word segmentation The IR system we used is a modified version of the SMART system. (the modifications are made in order to deal with Chinese) Once Chinese sentences have been segmented in separate items, traditional IR systems may be used to index them. These separate items are called "terms" in IR. The resources we used include �The IR system we used is a modified version of the SMART system. (the modifications are made in order to deal with Chinese) Once Chinese sentences have been segmented in separate items, traditional IR systems may be used to index them. These separate items are called "terms" in IR. The resources we used include �

12. Introduction (2) Our work involves two aspects: Chinese IR Finding the best indexing unit Query expansion, etc. CLIR � query translation Translation disambiguation using co-occurrence Phrase detecting and translation using language model Translation coverage enhancement using translation model Resources Lexicon: Chinese, bilingual (LDC, HIT, etc.); Corpus: Chinese, bilingual; Software tools: NLPWin, IBM MT, etc. On Chinese monolingual retrieval, we investigated the use of different entities as indexes, pseudo-relevance feedback, On English-Chinese CLIR, our focus was put on finding effective ways for query translation. Our method incorporates three improvements over the simple lexicon-based translation: (1) word/term disambiguation using co-occurrence, (2) phrase detecting and translation using a statistical language model and (3) translation coverage enhancement using a statistical translation model On Chinese monolingual retrieval, we investigated the use of different entities as indexes, pseudo-relevance feedback, On English-Chinese CLIR, our focus was put on finding effective ways for query translation. Our method incorporates three improvements over the simple lexicon-based translation: (1) word/term disambiguation using co-occurrence, (2) phrase detecting and translation using a statistical language model and (3) translation coverage enhancement using a statistical translation model

13. Outline Introduction Finding the best indexing units for Chinese IR Query translation Query expansion Experimental results in TREC 9 Conclusion

14. Characteristics of Chinese IR Chinese language issues: No standard definition of word and lexicon No space between words Word is the basic unit of indexing in traditional IR In this study: Basic unit of indexing in Chinese IR � word, n-gram, or mixed � Does the accuracy of word segmentation have a significant impact on IR performance It is well known that there are 2 major difference between Chinese IR and IR in European languages. The 1st is that there is no � The 2nd is that there is no .. In written Chinese, however word is the basic unit of indexing in traditional IR. In this study we wish to answer the following questions � What is the basic unit of indexing in Chinese IR � words or n-grams, is it worthwhile to combine both? And Does the accuracy of word segmentation have a significant impact on IR performance? It is well known that there are 2 major difference between Chinese IR and IR in European languages. The 1st is that there is no � The 2nd is that there is no .. In written Chinese, however word is the basic unit of indexing in traditional IR. In this study we wish to answer the following questions � What is the basic unit of indexing in Chinese IR � words or n-grams, is it worthwhile to combine both? And Does the accuracy of word segmentation have a significant impact on IR performance?

15. Indexing Units for Chinese IR Using n-grams No linguistic knowledge required Character unigram and bigram is widely used (average length of Chinese word is 1.6 characters) Using words Linguistic knowledge is required for word segmentation � dictionary, heuristic rules, � In previous studies, there are 2 kinds of indexing units for Chinese IR, using n-grams and using words. The advantage of using n-grams is that it does not require any linguistic knowledge. Character bi-gram is widely used, because the average length of Chinese words is about 1.6. Longer n-gram is rarely used due to the high memory cost. When using words, linguistic knowledge such as � is required. In previous studies, there are 2 kinds of indexing units for Chinese IR, using n-grams and using words. The advantage of using n-grams is that it does not require any linguistic knowledge. Character bi-gram is widely used, because the average length of Chinese words is about 1.6. Longer n-gram is rarely used due to the high memory cost. When using words, linguistic knowledge such as � is required.

16. Possible representations in Chinese IR To sum up, we can create 3 possible representations for a document and a query as shown in this figure. , say words, characters , and bi-grams Documents and queries can be matched using same representations, or across some representationsTo sum up, we can create 3 possible representations for a document and a query as shown in this figure. , say words, characters , and bi-grams Documents and queries can be matched using same representations, or across some representations

17. Experiments Impact of dict. � using the longest matching with a small dict. and with a large dict. Combining the first method with single characters Using full segmentation Using bi-grams and uni-grams (characters) Combining words with bi-grams and characters Unknown word detection using NLPWin We conduct a series of experiments in order to find out the best units of indexing in Chinese IR � the small dictionary contains 65502 entries, the large dictionary contains 220k entries, and is quite complete. A certain number of the entries in the large dictionary are expressions and suffix structure described earlier, such as date, digital, etc. To overcome the problem of the longest match word segmentation � loss of recall, as we mentioned earlier, we first use single characters And also extract the short words implied in the long words (what we call full segmentation). Then we report the result of using bi-grams and unigrams As bi-gram and words have their own advantages, we try to combine them to benefit from both of them. Theoretically, such a combination would result in a better precision due to words, and an increased robustness for unknown words due to n-gramsWe conduct a series of experiments in order to find out the best units of indexing in Chinese IR � the small dictionary contains 65502 entries, the large dictionary contains 220k entries, and is quite complete. A certain number of the entries in the large dictionary are expressions and suffix structure described earlier, such as date, digital, etc. To overcome the problem of the longest match word segmentation � loss of recall, as we mentioned earlier, we first use single characters And also extract the short words implied in the long words (what we call full segmentation). Then we report the result of using bi-grams and unigrams As bi-gram and words have their own advantages, we try to combine them to benefit from both of them. Theoretically, such a combination would result in a better precision due to words, and an increased robustness for unknown words due to n-grams

18. Summary of Experiments These experiments are summarized in this figure. Better dictionary leads very limit IR improvements Adding single characters is a more effective way to increase IR performance than increasing dictionary size Full segmentation is better than longest-matching, but worse than adding single characters Using N-gram we obtained an average precision of 0.4254. This performance is comparable to the best performance we obtained using words. This is largely contributed to the robustness for unknown words of n-grams, but the disadvantage of using n-grams is that much large number of indexes are produced and leads to very expensive in indexing space and time. As bi-grams and words have their own advantages, we try to combine them to benefit from both. Theoretically, such a combination would result in a better precision due to words, and an increased robustness for unknown words due to n-grams. But the result is very frustrated. It only leads to slightly improvements over the uncombined cases, whereas the space and the time are more than doubled. Add unknown words does help. In conclusion, the best way for Chinese IR and CLIR with Chinese is to use a combination of words and characters. This corresponds to the bold lines in this figure. These experiments are summarized in this figure. Better dictionary leads very limit IR improvements Adding single characters is a more effective way to increase IR performance than increasing dictionary size Full segmentation is better than longest-matching, but worse than adding single characters Using N-gram we obtained an average precision of 0.4254. This performance is comparable to the best performance we obtained using words. This is largely contributed to the robustness for unknown words of n-grams, but the disadvantage of using n-grams is that much large number of indexes are produced and leads to very expensive in indexing space and time. As bi-grams and words have their own advantages, we try to combine them to benefit from both. Theoretically, such a combination would result in a better precision due to words, and an increased robustness for unknown words due to n-grams. But the result is very frustrated. It only leads to slightly improvements over the uncombined cases, whereas the space and the time are more than doubled. Add unknown words does help. In conclusion, the best way for Chinese IR and CLIR with Chinese is to use a combination of words and characters. This corresponds to the bold lines in this figure.


20. Query Translation Problems of simple lexicon-based approaches Lexicon is incomplete Difficult to select correct translations Our improved lexicon-based approach Term disambiguation using co-occurrence Phrase detecting and translation using LM Translation coverage enhancement using TM The main problems of the simple lexicon-based approaches are: 1) the dictionary used may be incomplete; and (2) it is difficult to select the correct translation from the lexicon. To deal with these issues, we used an improved lexicon-based query translation. It tries to improve the lexicon-based translation through 3 methods, namely (1) term disambiguation using co-occurrence, (2) phrase detecting and translation using a statistical language model, and (3) translation coverage enhancement using a statistical translation model. The main problems of the simple lexicon-based approaches are: 1) the dictionary used may be incomplete; and (2) it is difficult to select the correct translation from the lexicon. To deal with these issues, we used an improved lexicon-based query translation. It tries to improve the lexicon-based translation through 3 methods, namely (1) term disambiguation using co-occurrence, (2) phrase detecting and translation using a statistical language model, and (3) translation coverage enhancement using a statistical translation model.

21. Term disambiguation Assumption � correct translation words tend to co-occur in Chinese language A greedy algorithm: for English terms Te = (e1�en), find their Chinese translations Tc = (c1�cn), such that Tc = argmax SIM(c1, �, cn) Term-similarity matrix � trained on Chinese corpus It is assumed that the correct translation words of a query tend to co-occur in target language documents. We use a greedy algorithm to choose the best translation words. For every query word, we choose the translation that has the highest similarity with other translation words. It is assumed that the correct translation words of a query tend to co-occur in target language documents. We use a greedy algorithm to choose the best translation words. For every query word, we choose the translation that has the highest similarity with other translation words.

22. Phrase detection and translation Multi-word phrase is detected by BaseNP identification [Xun, 2000] Translation pattern (PATTe), e.g. <NOUN1 NOUN2> ?? <NOUN1 NOUN2> <NOUN1 of NOUN2> ?? <NOUN2 NOUN1> Phrase translation: Tc = argmax P(OTc|PATTe)P(Tc) P(OTc|PATTe): prob. of the translation pattern P(Tc): prob. of the phrase in Chinese LM In addition, we try to incorporate phrase translation to improve the translation quality. We define a set of translation patterns between English and Chinese. For example, a (NOUN-1 NOUN-2) phrase is usually translated into the (NOUN-1 NOUN-2) sequence in Chinese. So from a English phrase we can guess a Chinese phrase translation. If several patterns can be applied, we use a statistic model to choose the best one. The first element in this equation is the probability of the translation phrase being generated from a pattern, and the 2rd element is the prob. of the phrase in Chinese LM. In addition, we try to incorporate phrase translation to improve the translation quality. We define a set of translation patterns between English and Chinese. For example, a (NOUN-1 NOUN-2) phrase is usually translated into the (NOUN-1 NOUN-2) sequence in Chinese. So from a English phrase we can guess a Chinese phrase translation. If several patterns can be applied, we use a statistic model to choose the best one. The first element in this equation is the probability of the translation phrase being generated from a pattern, and the 2rd element is the prob. of the phrase in Chinese LM.

23. Using translation model (TM) Enhance the coverage of the lexicon Using TM Tc = argmax P(Te|Tc)SIM(Tc) Mining parallel texts from the Web for TM training Translations stored in lexicons are always limited. Parallel texts may contain additional translations. Therefore, we used a statistical translation model. This TM is trained on the parallel texts which are automatically mined from the Web. Translations stored in lexicons are always limited. Parallel texts may contain additional translations. Therefore, we used a statistical translation model. This TM is trained on the parallel texts which are automatically mined from the Web.

24. Experiments on TREC-5&6 Monolingual Simple translation: lexicon looking up Best-sense translation: 2 + manually selecting Improved translation (our method) Machine translation: using IBM MT system We carried out a series of tests to compare our improved method with the following four cases We carried out a series of tests to compare our improved method with the following four cases

25. Summary of Experiments The results on query translation are summarized in this table. As can be expected, the simple translation methods are not very good. Their performances are lower than 60% of the monolingual performance. The best-sense method achieves 73.05% of monolingual effectiveness. However, it is still worse than our improved translation method, which achieves a 75.40% performance of monolingual. IBM MT system is very powerful. It achieves 75.55% of monolingual effectiveness. We see that the most powerful commercial machine translation system performs almost the same as our improved method. This indicates the effectiveness of our approaches. The best performance is achieved by combining two sets of translation queries obtained by machine translation method and the improved translation method. It is over 85% of monolingual effectiveness. The results on query translation are summarized in this table. As can be expected, the simple translation methods are not very good. Their performances are lower than 60% of the monolingual performance. The best-sense method achieves 73.05% of monolingual effectiveness. However, it is still worse than our improved translation method, which achieves a 75.40% performance of monolingual. IBM MT system is very powerful. It achieves 75.55% of monolingual effectiveness. We see that the most powerful commercial machine translation system performs almost the same as our improved method. This indicates the effectiveness of our approaches. The best performance is achieved by combining two sets of translation queries obtained by machine translation method and the improved translation method. It is over 85% of monolingual effectiveness.


27. Query Expansion (QE) Pseudo-relevance feedback Top-ranked documents (n) Term selection (m) Term weighting (w) Document length normalization Sub-document (500 characters) Pre-translation QE and post-translation QE A popular method of query expansion is the pseudo relevant feedback. Some optimal parameter settings need to be decided, say n m w. We also investigate the impact of document length normalization by dividing a document into sub-documents In CLIR, queries can be expanded prior to translation, after translation or both before and after translation. A popular method of query expansion is the pseudo relevant feedback. Some optimal parameter settings need to be decided, say n m w. We also investigate the impact of document length normalization by dividing a document into sub-documents In CLIR, queries can be expanded prior to translation, after translation or both before and after translation.

28. Experiments on TREC-5&6 (1) Post-translation QE ltu: n=10 , m=300 , w=0.6/0.4 ltc: n=20 , m=500 , w=0.3/0.7 For post-translation QE, we try different combinations of weighting scheme and parameter settings. The best result among our tests is shown in the last line. ..For post-translation QE, we try different combinations of weighting scheme and parameter settings. The best result among our tests is shown in the last line. ..

29. Experiments on TREC-5&6 (2) Pre-translation QE English collection � FBIS ltu: n=10 , m=10 , w=0.5/0.5 For pre-translation QE, we use an English query to retrieve English documents in FBIS. The top documents are used for QE. It turns out that using both pre&post-translation QE, we get best result.For pre-translation QE, we use an English query to retrieve English documents in FBIS. The top documents are used for QE. It turns out that using both pre&post-translation QE, we get best result.


31. Experiments in TREC 9 We submitted 3 official runs, the first is the monolingual. The 2 others are cross-language runs. Both used the combined translation method (I.e. that is our improved translation method combined with the MT results). Unlike in the trec5&6 case, pre-translation QE does not help. It is interesting to find that the CLIR is better than monolingual run. There might be 2 possible reasons. First, we combined several translation results, therefore we can find more correct translation words, the 2nd reason is that sometimes our translation words are better then the provided manual translation. After official submission, we evaluated our improved translation methods and the IBM MT method, separately. It turns out again that their performance are similar. We submitted 3 official runs, the first is the monolingual. The 2 others are cross-language runs. Both used the combined translation method (I.e. that is our improved translation method combined with the MT results). Unlike in the trec5&6 case, pre-translation QE does not help. It is interesting to find that the CLIR is better than monolingual run. There might be 2 possible reasons. First, we combined several translation results, therefore we can find more correct translation words, the 2nd reason is that sometimes our translation words are better then the provided manual translation. After official submission, we evaluated our improved translation methods and the IBM MT method, separately. It turns out again that their performance are similar.


33. Conclusion Best indexing unit for Chinese IR Words + characters + unknown words Improved lexicon based query translation Translation disambiguation using co-occurrence Phrase detecting and translation using language model Translation coverage enhancement using translation model Query expansion

34. Conclusion TREC 9 Pre-translation QE does not help Our approach leads to same effectiveness as the IBM MT system. The best result is obtained by combining IBM MT system and our approach OOV is still the bottleneck for improving the performance of CLIR (1) Pre-translation QE does not help (2) Our approach leads to same effectiveness as the IBM MT system. (3) The best result is obtained by combining IBM MT system and our approach (4) Further analysis shows that OOV is still the bottleneck for improving the performance of CLIR (1) Pre-translation QE does not help (2) Our approach leads to same effectiveness as the IBM MT system. (3) The best result is obtained by combining IBM MT system and our approach (4) Further analysis shows that OOV is still the bottleneck for improving the performance of CLIR

35. Thanks ! More information: jfgao@microsoft.commingzhou@microsoft.com

TREC-9 CLIR Experiments at MSRCN

TREC-9 CLIR Experiments at MSRCN

Presentation Transcript

Question Answering at TREC Mark A. Greenwood

HADES experiments at GSI .

Experiments at Level 7

Structure Experiments – at HIGS

Beyond TREC-QA

IIIT Hyderabad’s CLIR experiments for FIRE-2008

SEEDING EXPERIMENTS AT SPARC

UIC at TREC 2007: Genomics Track

Video indexing and retrieval at TREC 2002

TREC 2009 Review

Cross Language Information Retrieval (CLIR)

The TREC-9 Adaptive Filtering track

About TREC and TREC Education

UIC at TREC 2006: Blog Track

UIC at TREC 2006: Genomics Track

Lecture 28: CLIR

TREC UPDATE

CLIR

Sabir at TREC 2007 Legal Workshop

Question Answering at TREC Mark A. Greenwood

Sabir at TREC 2007 Legal Workshop

TREC