220 likes | 305 Views
Compact Query Term Selection Using Topically Related Text. Date : 2013/10/09 Source : SIGIR’13 Authors : K. Tamsin Maxwell, W. Bruce Croft Advisor : Dr.Jia -ling, Koh Speaker : Shun-Chen, Cheng. Outline. Introduction The PhRank Algorithm Graph Construction Edge Weight Random Walk
E N D
Compact Query Term SelectionUsing Topically Related Text Date:2013/10/09 Source:SIGIR’13 Authors : K. Tamsin Maxwell, W. Bruce Croft Advisor : Dr.Jia-ling, Koh Speaker : Shun-Chen, Cheng
Outline • Introduction • The PhRank Algorithm • Graph Construction • Edge Weight • Random Walk • Vertex weights • Term ranking • Diversity filter • Experiment • Conclusions
Introduction Query:Locations of volcanic activity which occurred within the present day boundaries of the U.S. and its territories.
Introduction • long queries contain words that are peripheral or shared across many topics so expansion is prone to query drift. • Past:jointly optimize weights and term selection using both global statistics and local syntacticfeatures • Shortcoming:fail to detect or differentiate informative terms Don’t identify all the informative relations Don’t reflect local query context
Introduction • Goal:novel term ranking algorithm, PhRank, that extends work on Markov chain frameworks for query expansion to select compact and focused terms from within a query itself.
Outline • Introduction • The PhRank Algorithm • Graph Construction • Edge Weight • Random Walk • Vertex weights • Term ranking • Diversity filter • Experiment • Conclusions
Principles for Term Selection • An informative word: • Is informative relative to a query accurately represent the meaning of aquery. • Is related to other informative words if one index termis good at discriminating relevant from non-relevantdocuments, then any closely associated index term isalso likely to be good at this • Contains informative words all terms must containinformative words. • Is discriminative in the retrieval collection A term that occurs many times within a small number of documents gives a pronounced relevance signal.
Graph Construction • C:retrieval collection & English Wikipedia Example: • Q:a b • Top k documents :d1,d2 (if k=2) • N(Neighborhood set): {d0,d1,d2},d0:query encoded Graph G: d1:c b e d2:afb a f e b c
Edge Weight 、 : the counts of stem co-occurrence in window size=2 and 10 in N : the probability of the document in which the stems i and j co-occur given Q With idf-weight: factor r confirms the importance of a connection between i and j in N
Random Walk 0.009 1 0.9 0.001 0.1 0.6 0.005 3 2 0.8 0.395 0.1 If it starts from node 1 at time=0 Then the probability that walks to node 3 at time=1 0.0090.90.01 0.0090.90.01 0.10.80.1 0.10.80.1 H= 0.60.0050.395 0.60.0050.395 = 1 0 0 0.6 0.005 0.395
Vertex weights • Factor s balances exhaustivity with global saliency to identify stems that are poor discriminators been relevant and non-relevant documents frequency of a word wnin N ,averagedover k + 1 documents, and normalized by the maximum average frequency of any term in N the number of documents in C containing wn TREC query #840:‘Give the definition, locations, or characteristics of geysers’. => “definition geysers” is not more informative
Example |N| = 3 , |C|=35 Wn = geysers The avg frequency of “geysers” in N = 12/3 max avg frequency of any term in N = 4 dfwn = 3 Wn = definition The frequency of “definition” in N = 2/3 max avg frequency of any term in N = 4 dfwn = 1
Term ranking • Input:all combinations of 1-3 words in a query that are not stopwords. • Output:Rank list sorted by f(x,Q) score • To avoid a bias towards longer terms, a term x is scored by averaging the affinity scores for its component words • factor zx that represents the degree to which the term is discriminative in a collection the frequency of xe in C
Query: Locations of volcanic activity which occurred within the present day boundaries of the U.S. and its territories. • example Term x = volcanic boundaries Term x = volcanic U.S
Outline • Introduction • The PhRank Algorithm • Diversity filter • Experiment • Conclusions
Diversity filter • PhRank often assigns a high rank to multi-word terms that contain only one highly informative word • For example, query: the destruction of Pan Am Flight 103 over Lockerbie, Scotland • term ‘pan flight 103 ’ is informative • “pan” is uninformative by itself Example: Way 1: Way 2: . . declining birth . birth rate . . . declining birth rate . . birth rate china . . . . birth rate Discarded! on the assumption that the shorter terms better represent the information need and the longer term is redundant. on the assumption that the longer term better represents the information need.
Outline • Introduction • The PhRank Algorithm • Diversity filter • Experiment • Conclusions
Experiment • Dataset F:excluded from features ,T:include in features
Experiment TREC description topics TREC title queries
Outline • Introduction • The PhRank Algorithm • Diversity filter • Experiment • Conclusions
Conclusions • have presented PhRank, a novel term ranking algorithm that extends work on Markov chain frameworks for query expansion to select focused and succinct terms from within a query. • For all collections, around 26% of queries have more than 5% decrease in MAP compared to SD • Efficiency considerations surrounding the time to construct an affinity graph may be ameliorated by off-line indexing to precompute a language model for each document in a collection.