190 likes | 300 Views
Compact Query Term Selection Using Topically Related Text. K. Tamsin Maxwell, W. Bruce Croft SIGIR 2013. Outline. Introduction Related Work Principle for Term Selection PhRank Algorithm Evaluation Framework Experiments Conlusion. Introduction.
E N D
Compact Query Term SelectionUsing Topically Related Text K. TamsinMaxwell, W. Bruce Croft SIGIR 2013
Outline • Introduction • Related Work • Principle for Term Selection • PhRank Algorithm • Evaluation Framework • Experiments • Conlusion
Introduction • Recent query reformulation techniques usually uses pseudo relevant feecback in their approaches. But since they consider words which not in the original query, the expansion may include peripheral words and causes query drift • PhRank also uses PRF, but uses them for in-query term selection. Each indicate term include 1-3 words, and ranked with score which from a co-occurrence graph • Here we list advantages of PhRank • It’s the first method to use PRF for in-query term selection • Only small number of terms are selected, so it retaining the flexibility for more or longer terms if required • The affinity graph captures aspects of both syntactic and non-syntactic word associations
Related Work • Markov chain framework • The Markov chain framework uses the stationary distribution of a random walk over an affinity graph G to estimate the importance of vertices in the graph • A random walk describes a succession of random or semi-random steps between vertices and in • If we define transition probability between and as , and as affinity score of at time t, then is the sum of scores for each connect to
Related Work • Sometimes step to some that may be unconnected, so we often define a minimum probability , where is the number of vertices in then we uses a factor to control the balance between transition probability and minimum probability
Principle for Term Selection • For an informative word • Is informative relative to a query:a word should represent the meaning of query, but query usually doesn’t have enough information. PRF is used to enhancing a query representation • Is related to other informative words:The Association Hypothesis states that, “if one index term is good at discriminating relevant from non-relevant documents, then any closely associated index term is also likely to be good at this”. With a affinity graph, we can get the information above by estimate the number of word connects to a target word and the value
Principle for Term Selection • For a informative term • Contains informative words:We deduce all terms must contain informative words, so we consider individual words when ranking terms • Is discriminative in retrieval collection:A term that occurs many times within a small number of documents gives a pronounced relevance signal. So we weights terms with a normalized tf.idf inspired weight
The PhRank Algorithm • Graph construction • For a query, we first retrieve top documents. Then we define set as set of query itself and its relevant documents • Do stemming for documents in . Each unique word is now a vertex in graph • Edges between vertices and are connected if word and is adjacent in • Edge weights • Transition probability is based on linear combination of word and co-occur in window size of 2 and 10
The PhRank Algorithm • Edge weights are defined by is the probability of document in which word and co-occur given , and and is the count of co-occur in window 2 and 10 • is the style weight confirms importance between and in
The PhRank Algorithm • Random walk • A random walk of is proceed as we represent in related work • The edge weights are normalized to sum to one • The iteration stopped when the difference between any vertex dies not exceed 0.0001 • Vertex weights • The word are also weighted to exhaustiveness represent the query. Some words like “make ” would high score in affinity graph, but it is not more informative
The PhRank Algorithm • We define as factor to balance exhaustively with global saliency to identify stems that are poor discriminators been relevant and non-relevant documents • For a word , is the frequency of in , and is of in
The PhRank Algorithm • Term ranking • For a term , Factor represents the degree to which the term is discriminative in a collection. is defined by is the frequency of words in co-occur in 4*number of term window in collection, defined just like , and • Finally, the rank of a term for is defined as
The PhRank Algorithm • After finish the rank, we still have some terms that includes uninformative words. This is because we rank terms by the whole score, so some terms would contain the similar words and decrease the diversity • We apply a simple filtering with top-down constraints • For term , If a higher rank term contains all words in or contains all words in higher rank term, we discard
Evaluation Framework • Robustness • Compare with sequential dependence of Markov random field model. This model uses linear combine for query likelihood, 2 and 8 window sized bigram • Precision • The subset distribution model achieves high mean average precision • Succinctness • We use Key Concepts as the succinctness approach. This approach linear combined bag-of-words query representation and weighted bag-of-words query representation
Evaluation Framework • Word dependence • We refers four models of phrase belief as the figure
Experiments • We use Indri on Robust04, WT10G and GOV2 for evaluate • Feature analysis • Here we list the results of using the features in PhRank
Experiments • Compare with other model
Conclusion • PhRank is a novel method to select succinct term within a query which works on Markov chain frameworks • Although the term is succinct, but its risky strategy and causes the decreasing of mAP compared with sequential dependence