220 likes | 345 Views
Exploiting Proximity Feature in Statistical Translation Models for Information Retrieval. 報告者 :101598035 邱威霖. Reference. CIKM '13 Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
E N D
Exploiting Proximity Feature in Statistical Translation Models for Information Retrieval 報告者:101598035 邱威霖
Reference • CIKM '13 Proceedings of the 22nd ACM international conference on Conference on information & knowledge management • XinhuiTu, Jing Luo, Maofu Liu, College of Computer Science and Technology, Wuhan University of Science and Technology, Wuhan, China • Bo Li, TingtingHe, Department of Computer Science, Central China Normal University, Wuhan, China
Outline • Introduction • Statistical Translation Model For Retrieval • Proximity-Based Translation Langauge Model • Experiments • Conclusions
Introduction • In this paper, theystudy how to explicitly incorporate proximity information into the existing translation language model, and propose a proximity based translation language model, called TM-P, with three variants. • In TM-P models, a new concept isintroduced to model the proximity of word co-occurrences, which is then used to estimate translation probabilities.
Statistical Translation Model For Retrieval • In the basic language modelling approach, documents are ranked by the probability that the query text could be generated by the document language model. • Given a query q1, q2, q3, … qm and a document d, the query likelihood scoring function is as follows:
Statistical Translation Model For Retrieval • In translation language modelling approach, the document model p(w|d)can be calculated by using the following “translation document model”:
Statistical Translation Model For Retrieval • In this way, a word can be translated into its semantically related words with non-zero probability, which allows us to score a document by counting the matches between a query word and semantically related words in document.
Proximity-Based Translation Langauge Model • Previous studies have shown that translation language modelworks better with Dirichlet Prior smoothing [4][5]. • Therefore, inthe rest of the paper, we further focus on the translation languagemodel with Dirichlet prior smoothing only as follows:
Proximity-Based Translation LangaugeModel • Estimating Translation Probability • They introduce a new concept, namely proximity-based word co-occurrence frequency (pcf) to model the proximity feature of co-occurrencies.
Proximity-Based Translation Langauge Model • In this paper, three commonly used distance measures are adopted to calculate dist(w,u,d). • Minimum pair distance: • Average pair distance: • Average minimum pair distance:
Minimum pair distance • It is defined as the minimum distance between any occurrences of w and u in document d . • In the example, dist(w,u,d)is 1 and can be calculated from the position vectors.
Average pair distance • It is defined as the average distance between w and u for all position combinations in d . • In the example, the distances from the first occurrence of w (in position 1) to all occurrences of u are: {1 and 5}. This is computed for the next occurrence of w (in position 5) and so on. dist(w,u,d)for the example is (((2-1) + (6-1)) + ((5-2) + (6-5)) + ((9-2) + (9-6)))/(2 · 3) = 20/6 = 3.33.
Average minimum pair distance • It is defined as the average of the shortest distance between each occurrence of the least frequently occurring word and any occurrence of the other word. • In the example, u is the least frequently occurring word so dist(w,u,d)= ((2−1)+(6−5))/2 =1.
Proximity-Based Translation Langauge Model • Then, the probability of translating word u into word w can be estimated as follows:
Proximity-Based Translation Langauge Model • Optimizing Self-translation Probability • In order to satisfy the constraints defined in [5], we adjust translation language model as follows:
Experiments • The experiments in this section use three main document collections: • ad hoc data in TREC7 with TREC topics 351-400 and 528,155 articles • WSJ news articles with TREC topics 51-100 • technical reports in DOE abstracts with TREC topics 51-100.
Experiments-Parameter Sensitivity Study • An important issue that may affect the robustness of the TM-Pmodels is the sensitivity of their parameters s and σ . • The parameter s controls the amount ofself-translation probabilities. • The kernel parameter σ determinesthe distance in which words are considered to be related. • In thissection, we study how sensitive these two parameters are to MAPmeasure.
Conclusion • In this paper, a new type of translation language model, called TM-P, is proposed by explicitly incorporating proximity information into the existing translation language model. • The corresponding models based on these measures, TM-P1, TM-P2 and TM-P3, are evaluated on three standard TREC collections. • Our experiment results indicate that our TM-P models are more effective than the state-of-art translation models. • Comparing the three variants of TM-P, TM-P3 is more effective than TM-P1 and TM-P2. In the future, we will try to study how to apply TM-P in other text processing tasks.