340 likes | 572 Views
People. Jianfeng Gao, Microsoft Research ChinaJian-Yun Nie, Universit
E N D
1. Jianfeng Gao
Microsoft Research China (MSRCN) TREC-9 CLIR Experiments at MSRCN
2. People Jianfeng Gao, Microsoft Research China
Jian-Yun Nie, Universit de Montral
Jian Zhang, Tsinghua University, China
Endong Xun, Microsoft Research China
Yi Su, Tsinghua University, China
Ming Zhou, Microsoft Research China
Changning Huang, Microsoft Research China This is a joint work of MSRCN, Montreal, and Tsinghua.
Here is the name list of people who contribute to this work. This is a joint work of MSRCN, Montreal, and Tsinghua.
Here is the name list of people who contribute to this work.
3. What is TREC ? A workshop series that provides the infrastructure for large-scale testing of text retrieval technology
Realistic test collection
Uniform, appropriate scoring procedures
A forum for the exchange of research ideas and for the discussion of research methodology
Sponsored by NIST, DARPA/ITO, ARDA
4. TREC-9 Task Tracks Cross-Language Information Retrieval (CLIR)
Filtering
Interactive
Query
Question Answering
Spoken Document Retrieval
Web Track
5. Given a topic in English, retrieve the top 1000 documents ranked by similarity to the topic from a collection of Chinese newspaper/wire documents. There were 9 groups who contributed to the design and actually ran the experiments.
I also want to thank friends of the track who did not run experiments for their contributions to the design discussions.
This talk is laid out as follows >>>There were 9 groups who contributed to the design and actually ran the experiments.
I also want to thank friends of the track who did not run experiments for their contributions to the design discussions.
This talk is laid out as follows >>>
6. 25 English topics (CH55-79) created at NIST
Example:
<num> Number: CH55
<title> World Trade Organization membership
<desc> Description: What speculations on the effects of the entry of China or Taiwan into the World Trade Organization (WTO) are being reported in the Asian press?
<narr> Narrative: Documents reporting support by other nations for China's or Taiwan's entry into the World Trade Organization (WTO) are not relevant. There were 9 groups who contributed to the design and actually ran the experiments.
I also want to thank friends of the track who did not run experiments for their contributions to the design discussions.
This talk is laid out as follows >>>There were 9 groups who contributed to the design and actually ran the experiments.
I also want to thank friends of the track who did not run experiments for their contributions to the design discussions.
This talk is laid out as follows >>>
7. 126,937 documents; 188 MB;
Traditional Chinese, BIG5 encoding
Sources:
Hong Kong Commercial Data
11. Aug 98 - 31. Jul 99
Hong Kong Daily News
1. Feb 99 - 31. Jul 99
Takongpao
21. Oct 98 - 4. Mar 99
There were 9 groups who contributed to the design and actually ran the experiments.
I also want to thank friends of the track who did not run experiments for their contributions to the design discussions.
This talk is laid out as follows >>>There were 9 groups who contributed to the design and actually ran the experiments.
I also want to thank friends of the track who did not run experiments for their contributions to the design discussions.
This talk is laid out as follows >>>
8. BBN Technologies
Fudan University
IBM T.J. Watson Research Center
Johns Hopkins University
Korea Advanced Institute of Science and Technology
Microsoft Research, China
MNIS-TextWise Labs
National Taiwan University
There were 9 groups who contributed to the design and actually ran the experiments.
I also want to thank friends of the track who did not run experiments for their contributions to the design discussions.
This talk is laid out as follows >>>There were 9 groups who contributed to the design and actually ran the experiments.
I also want to thank friends of the track who did not run experiments for their contributions to the design discussions.
This talk is laid out as follows >>>
9. Queens College, CUNY
RMIT University
Telcordia Technologies, Inc.
The Chinese University of Hong Kong
Trans-EZ Inc.
University of California at Berkeley
University of Maryland
University of Massachusetts There were 9 groups who contributed to the design and actually ran the experiments.
I also want to thank friends of the track who did not run experiments for their contributions to the design discussions.
This talk is laid out as follows >>>There were 9 groups who contributed to the design and actually ran the experiments.
I also want to thank friends of the track who did not run experiments for their contributions to the design discussions.
This talk is laid out as follows >>>
10. Outline Introduction
Finding the best indexing units for Chinese IR
Query translation
Query expansion
Experimental results in TREC 9
Conclusion In this talk. After a brief introduction to the system, and resource we used.
I will describe our approaches in TREC 9 experiments, including (1) (2) (3)
After that, let me show you some experimental results
Finally, I give the conclusionIn this talk. After a brief introduction to the system, and resource we used.
I will describe our approaches in TREC 9 experiments, including (1) (2) (3)
After that, let me show you some experimental results
Finally, I give the conclusion
11. Introduction (1) Participate for the first time in TREC
System modified version of SMART
Pre-processing word segmentation
The IR system we used is a modified version of the SMART system. (the modifications are made in order to deal with Chinese)
Once Chinese sentences have been segmented in separate items, traditional IR systems may be used to index them. These separate items are called "terms" in IR.
The resources we used include The IR system we used is a modified version of the SMART system. (the modifications are made in order to deal with Chinese)
Once Chinese sentences have been segmented in separate items, traditional IR systems may be used to index them. These separate items are called "terms" in IR.
The resources we used include
12. Introduction (2) Our work involves two aspects:
Chinese IR
Finding the best indexing unit
Query expansion, etc.
CLIR query translation
Translation disambiguation using co-occurrence
Phrase detecting and translation using language model
Translation coverage enhancement using translation model
Resources
Lexicon: Chinese, bilingual (LDC, HIT, etc.);
Corpus: Chinese, bilingual;
Software tools: NLPWin, IBM MT, etc.
On Chinese monolingual retrieval, we investigated the use of different entities as indexes, pseudo-relevance feedback,
On English-Chinese CLIR, our focus was put on finding effective ways for query translation. Our method incorporates three improvements over the simple lexicon-based translation: (1) word/term disambiguation using co-occurrence, (2) phrase detecting and translation using a statistical language model and (3) translation coverage enhancement using a statistical translation model On Chinese monolingual retrieval, we investigated the use of different entities as indexes, pseudo-relevance feedback,
On English-Chinese CLIR, our focus was put on finding effective ways for query translation. Our method incorporates three improvements over the simple lexicon-based translation: (1) word/term disambiguation using co-occurrence, (2) phrase detecting and translation using a statistical language model and (3) translation coverage enhancement using a statistical translation model
13. Outline Introduction
Finding the best indexing units for Chinese IR
Query translation
Query expansion
Experimental results in TREC 9
Conclusion
14. Characteristics of Chinese IR Chinese language issues:
No standard definition of word and lexicon
No space between words
Word is the basic unit of indexing in traditional IR
In this study:
Basic unit of indexing in Chinese IR word, n-gram, or mixed
Does the accuracy of word segmentation have a significant impact on IR performance It is well known that there are 2 major difference between Chinese IR and IR in European languages.
The 1st is that there is no
The 2nd is that there is no .. In written Chinese, however word is the basic unit of indexing in traditional IR.
In this study we wish to answer the following questions
What is the basic unit of indexing in Chinese IR words or n-grams, is it worthwhile to combine both?
And Does the accuracy of word segmentation have a significant impact on IR performance?
It is well known that there are 2 major difference between Chinese IR and IR in European languages.
The 1st is that there is no
The 2nd is that there is no .. In written Chinese, however word is the basic unit of indexing in traditional IR.
In this study we wish to answer the following questions
What is the basic unit of indexing in Chinese IR words or n-grams, is it worthwhile to combine both?
And Does the accuracy of word segmentation have a significant impact on IR performance?
15. Indexing Units for Chinese IR Using n-grams
No linguistic knowledge required
Character unigram and bigram is widely used (average length of Chinese word is 1.6 characters)
Using words
Linguistic knowledge is required for word segmentation dictionary, heuristic rules, In previous studies, there are 2 kinds of indexing units for Chinese IR, using n-grams and using words.
The advantage of using n-grams is that it does not require any linguistic knowledge. Character bi-gram is widely used, because the average length of Chinese words is about 1.6. Longer n-gram is rarely used due to the high memory cost.
When using words, linguistic knowledge such as is required.
In previous studies, there are 2 kinds of indexing units for Chinese IR, using n-grams and using words.
The advantage of using n-grams is that it does not require any linguistic knowledge. Character bi-gram is widely used, because the average length of Chinese words is about 1.6. Longer n-gram is rarely used due to the high memory cost.
When using words, linguistic knowledge such as is required.
16. Possible representations in Chinese IR To sum up, we can create 3 possible representations for a document and a query as shown in this figure. , say words, characters , and bi-grams
Documents and queries can be matched using same representations, or across some representationsTo sum up, we can create 3 possible representations for a document and a query as shown in this figure. , say words, characters , and bi-grams
Documents and queries can be matched using same representations, or across some representations
17. Experiments Impact of dict. using the longest matching with a small dict. and with a large dict.
Combining the first method with single characters
Using full segmentation
Using bi-grams and uni-grams (characters)
Combining words with bi-grams and characters
Unknown word detection using NLPWin We conduct a series of experiments in order to find out the best units of indexing in Chinese IR
the small dictionary contains 65502 entries, the large dictionary contains 220k entries, and is quite complete. A certain number of the entries in the large dictionary are expressions and suffix structure described earlier, such as date, digital, etc.
To overcome the problem of the longest match word segmentation loss of recall, as we mentioned earlier, we first use single characters
And also extract the short words implied in the long words (what we call full segmentation).
Then we report the result of using bi-grams and unigrams
As bi-gram and words have their own advantages, we try to combine them to benefit from both of them. Theoretically, such a combination would result in a better precision due to words, and an increased robustness for unknown words due to n-gramsWe conduct a series of experiments in order to find out the best units of indexing in Chinese IR
the small dictionary contains 65502 entries, the large dictionary contains 220k entries, and is quite complete. A certain number of the entries in the large dictionary are expressions and suffix structure described earlier, such as date, digital, etc.
To overcome the problem of the longest match word segmentation loss of recall, as we mentioned earlier, we first use single characters
And also extract the short words implied in the long words (what we call full segmentation).
Then we report the result of using bi-grams and unigrams
As bi-gram and words have their own advantages, we try to combine them to benefit from both of them. Theoretically, such a combination would result in a better precision due to words, and an increased robustness for unknown words due to n-grams
18. Summary of Experiments These experiments are summarized in this figure.
Better dictionary leads very limit IR improvements
Adding single characters is a more effective way to increase IR performance than increasing dictionary size
Full segmentation is better than longest-matching, but worse than adding single characters
Using N-gram we obtained an average precision of 0.4254. This performance is comparable to the best performance we obtained using words. This is largely contributed to the robustness for unknown words of n-grams, but the disadvantage of using n-grams is that much large number of indexes are produced and leads to very expensive in indexing space and time.
As bi-grams and words have their own advantages, we try to combine them to benefit from both. Theoretically, such a combination would result in a better precision due to words, and an increased robustness for unknown words due to n-grams. But the result is very frustrated. It only leads to slightly improvements over the uncombined cases, whereas the space and the time are more than doubled.
Add unknown words does help.
In conclusion, the best way for Chinese IR and CLIR with Chinese is to use a combination of words and characters. This corresponds to the bold lines in this figure.
These experiments are summarized in this figure.
Better dictionary leads very limit IR improvements
Adding single characters is a more effective way to increase IR performance than increasing dictionary size
Full segmentation is better than longest-matching, but worse than adding single characters
Using N-gram we obtained an average precision of 0.4254. This performance is comparable to the best performance we obtained using words. This is largely contributed to the robustness for unknown words of n-grams, but the disadvantage of using n-grams is that much large number of indexes are produced and leads to very expensive in indexing space and time.
As bi-grams and words have their own advantages, we try to combine them to benefit from both. Theoretically, such a combination would result in a better precision due to words, and an increased robustness for unknown words due to n-grams. But the result is very frustrated. It only leads to slightly improvements over the uncombined cases, whereas the space and the time are more than doubled.
Add unknown words does help.
In conclusion, the best way for Chinese IR and CLIR with Chinese is to use a combination of words and characters. This corresponds to the bold lines in this figure.
19. Outline Introduction
Finding the best indexing units for Chinese IR
Query translation
Query expansion
Experimental results in TREC 9
Conclusion
20. Query Translation Problems of simple lexicon-based approaches
Lexicon is incomplete
Difficult to select correct translations
Our improved lexicon-based approach
Term disambiguation using co-occurrence
Phrase detecting and translation using LM
Translation coverage enhancement using TM The main problems of the simple lexicon-based approaches are: 1) the dictionary used may be incomplete; and (2) it is difficult to select the correct translation from the lexicon.
To deal with these issues, we used an improved lexicon-based query translation. It tries to improve the lexicon-based translation through 3 methods, namely (1) term disambiguation using co-occurrence, (2) phrase detecting and translation using a statistical language model, and (3) translation coverage enhancement using a statistical translation model.
The main problems of the simple lexicon-based approaches are: 1) the dictionary used may be incomplete; and (2) it is difficult to select the correct translation from the lexicon.
To deal with these issues, we used an improved lexicon-based query translation. It tries to improve the lexicon-based translation through 3 methods, namely (1) term disambiguation using co-occurrence, (2) phrase detecting and translation using a statistical language model, and (3) translation coverage enhancement using a statistical translation model.
21. Term disambiguation Assumption correct translation words tend to co-occur in Chinese language
A greedy algorithm:
for English terms Te = (e1en),
find their Chinese translations Tc = (c1cn), such that Tc = argmax SIM(c1, , cn)
Term-similarity matrix trained on Chinese corpus It is assumed that the correct translation words of a query tend to co-occur in target language documents.
We use a greedy algorithm to choose the best translation words.
For every query word, we choose the translation that has the highest similarity with other translation words.
It is assumed that the correct translation words of a query tend to co-occur in target language documents.
We use a greedy algorithm to choose the best translation words.
For every query word, we choose the translation that has the highest similarity with other translation words.
22. Phrase detection and translation Multi-word phrase is detected by BaseNP identification [Xun, 2000]
Translation pattern (PATTe), e.g.
<NOUN1 NOUN2> ?? <NOUN1 NOUN2>
<NOUN1 of NOUN2> ?? <NOUN2 NOUN1>
Phrase translation:
Tc = argmax P(OTc|PATTe)P(Tc)
P(OTc|PATTe): prob. of the translation pattern
P(Tc): prob. of the phrase in Chinese LM
In addition, we try to incorporate phrase translation to improve the translation quality.
We define a set of translation patterns between English and Chinese.
For example, a (NOUN-1 NOUN-2) phrase is usually translated into the (NOUN-1 NOUN-2) sequence in Chinese. So from a English phrase we can guess a Chinese phrase translation.
If several patterns can be applied, we use a statistic model to choose the best one. The first element in this equation is the probability of the translation phrase being generated from a pattern, and the 2rd element is the prob. of the phrase in Chinese LM.
In addition, we try to incorporate phrase translation to improve the translation quality.
We define a set of translation patterns between English and Chinese.
For example, a (NOUN-1 NOUN-2) phrase is usually translated into the (NOUN-1 NOUN-2) sequence in Chinese. So from a English phrase we can guess a Chinese phrase translation.
If several patterns can be applied, we use a statistic model to choose the best one. The first element in this equation is the probability of the translation phrase being generated from a pattern, and the 2rd element is the prob. of the phrase in Chinese LM.
23. Using translation model (TM) Enhance the coverage of the lexicon
Using TM
Tc = argmax P(Te|Tc)SIM(Tc)
Mining parallel texts from the Web for TM training Translations stored in lexicons are always limited. Parallel texts may contain additional translations. Therefore, we used a statistical translation model. This TM is trained on the parallel texts which are automatically mined from the Web. Translations stored in lexicons are always limited. Parallel texts may contain additional translations. Therefore, we used a statistical translation model. This TM is trained on the parallel texts which are automatically mined from the Web.
24. Experiments on TREC-5&6 Monolingual
Simple translation: lexicon looking up
Best-sense translation: 2 + manually selecting
Improved translation (our method)
Machine translation: using IBM MT system We carried out a series of tests to compare our improved method with the following four cases We carried out a series of tests to compare our improved method with the following four cases
25. Summary of Experiments The results on query translation are summarized in this table. As can be expected, the simple translation methods are not very good. Their performances are lower than 60% of the monolingual performance.
The best-sense method achieves 73.05% of monolingual effectiveness. However, it is still worse than our improved translation method, which achieves a 75.40% performance of monolingual.
IBM MT system is very powerful. It achieves 75.55% of monolingual effectiveness.
We see that the most powerful commercial machine translation system performs almost the same as our improved method. This indicates the effectiveness of our approaches.
The best performance is achieved by combining two sets of translation queries obtained by machine translation method and the improved translation method. It is over 85% of monolingual effectiveness. The results on query translation are summarized in this table. As can be expected, the simple translation methods are not very good. Their performances are lower than 60% of the monolingual performance.
The best-sense method achieves 73.05% of monolingual effectiveness. However, it is still worse than our improved translation method, which achieves a 75.40% performance of monolingual.
IBM MT system is very powerful. It achieves 75.55% of monolingual effectiveness.
We see that the most powerful commercial machine translation system performs almost the same as our improved method. This indicates the effectiveness of our approaches.
The best performance is achieved by combining two sets of translation queries obtained by machine translation method and the improved translation method. It is over 85% of monolingual effectiveness.
26. Outline Introduction
Finding the best indexing units for Chinese IR
Query translation
Query expansion
Experimental results in TREC 9
Conclusion
27. Query Expansion (QE) Pseudo-relevance feedback
Top-ranked documents (n)
Term selection (m)
Term weighting (w)
Document length normalization
Sub-document (500 characters)
Pre-translation QE and post-translation QE A popular method of query expansion is the pseudo relevant feedback. Some optimal parameter settings need to be decided, say n m w.
We also investigate the impact of document length normalization by dividing a document into sub-documents
In CLIR, queries can be expanded prior to translation, after translation or both before and after translation. A popular method of query expansion is the pseudo relevant feedback. Some optimal parameter settings need to be decided, say n m w.
We also investigate the impact of document length normalization by dividing a document into sub-documents
In CLIR, queries can be expanded prior to translation, after translation or both before and after translation.
28. Experiments on TREC-5&6 (1) Post-translation QE
ltu: n=10 , m=300 , w=0.6/0.4
ltc: n=20 , m=500 , w=0.3/0.7 For post-translation QE, we try different combinations of weighting scheme and parameter settings. The best result among our tests is shown in the last line. ..For post-translation QE, we try different combinations of weighting scheme and parameter settings. The best result among our tests is shown in the last line. ..
29. Experiments on TREC-5&6 (2) Pre-translation QE
English collection FBIS
ltu: n=10 , m=10 , w=0.5/0.5 For pre-translation QE, we use an English query to retrieve English documents in FBIS. The top documents are used for QE.
It turns out that using both pre&post-translation QE, we get best result.For pre-translation QE, we use an English query to retrieve English documents in FBIS. The top documents are used for QE.
It turns out that using both pre&post-translation QE, we get best result.
30. Outline Introduction
Finding the best indexing units for Chinese IR
Query translation
Query expansion
Experimental results in TREC 9
Conclusion
31. Experiments in TREC 9 We submitted 3 official runs, the first is the monolingual. The 2 others are cross-language runs. Both used the combined translation method (I.e. that is our improved translation method combined with the MT results). Unlike in the trec5&6 case, pre-translation QE does not help.
It is interesting to find that the CLIR is better than monolingual run. There might be 2 possible reasons. First, we combined several translation results, therefore we can find more correct translation words, the 2nd reason is that sometimes our translation words are better then the provided manual translation.
After official submission, we evaluated our improved translation methods and the IBM MT method, separately. It turns out again that their performance are similar.
We submitted 3 official runs, the first is the monolingual. The 2 others are cross-language runs. Both used the combined translation method (I.e. that is our improved translation method combined with the MT results). Unlike in the trec5&6 case, pre-translation QE does not help.
It is interesting to find that the CLIR is better than monolingual run. There might be 2 possible reasons. First, we combined several translation results, therefore we can find more correct translation words, the 2nd reason is that sometimes our translation words are better then the provided manual translation.
After official submission, we evaluated our improved translation methods and the IBM MT method, separately. It turns out again that their performance are similar.
32. Outline Introduction
Finding the best indexing units for Chinese IR
Query translation
Query expansion
Experimental results in TREC 9
Conclusion
33. Conclusion Best indexing unit for Chinese IR
Words + characters + unknown words
Improved lexicon based query translation
Translation disambiguation using co-occurrence
Phrase detecting and translation using language model
Translation coverage enhancement using translation model
Query expansion
34. Conclusion TREC 9 Pre-translation QE does not help
Our approach leads to same effectiveness as the IBM MT system.
The best result is obtained by combining IBM MT system and our approach
OOV is still the bottleneck for improving the performance of CLIR (1) Pre-translation QE does not help
(2) Our approach leads to same effectiveness as the IBM MT system.
(3) The best result is obtained by combining IBM MT system and our approach
(4) Further analysis shows that OOV is still the bottleneck for improving the performance of CLIR
(1) Pre-translation QE does not help
(2) Our approach leads to same effectiveness as the IBM MT system.
(3) The best result is obtained by combining IBM MT system and our approach
(4) Further analysis shows that OOV is still the bottleneck for improving the performance of CLIR
35. Thanks ! More information: jfgao@microsoft.commingzhou@microsoft.com