Compact Query Term Selection Using Topically Related Text

Compact Query Term SelectionUsing Topically Related Text Date:2013/10/09 Source:SIGIR’13 Authors : K. Tamsin Maxwell, W. Bruce Croft Advisor : Dr.Jia-ling, Koh Speaker : Shun-Chen, Cheng

Outline • Introduction • The PhRank Algorithm • Graph Construction • Edge Weight • Random Walk • Vertex weights • Term ranking • Diversity filter • Experiment • Conclusions

Introduction Query：Locations of volcanic activity which occurred within the present day boundaries of the U.S. and its territories.

Introduction • long queries contain words that are peripheral or shared across many topics so expansion is prone to query drift. • Past：jointly optimize weights and term selection using both global statistics and local syntacticfeatures • Shortcoming：fail to detect or differentiate informative terms Don’t identify all the informative relations Don’t reflect local query context

Introduction • Goal：novel term ranking algorithm, PhRank, that extends work on Markov chain frameworks for query expansion to select compact and focused terms from within a query itself.

Outline • Introduction • The PhRank Algorithm • Graph Construction • Edge Weight • Random Walk • Vertex weights • Term ranking • Diversity filter • Experiment • Conclusions

Principles for Term Selection • An informative word： • Is informative relative to a query accurately represent the meaning of aquery. • Is related to other informative words if one index termis good at discriminating relevant from non-relevantdocuments, then any closely associated index term isalso likely to be good at this • Contains informative words all terms must containinformative words. • Is discriminative in the retrieval collection A term that occurs many times within a small number of documents gives a pronounced relevance signal.

Graph Construction • C：retrieval collection & English Wikipedia Example： • Q：a b • Top k documents ：d1,d2 (if k=2) • N(Neighborhood set)： {d0,d1,d2}，d0：query encoded Graph G： d1：c b e d2：afb a f e b c

Edge Weight 、： the counts of stem co-occurrence in window size=2 and 10 in N ： the probability of the document in which the stems i and j co-occur given Q With idf-weight： factor r confirms the importance of a connection between i and j in N

Random Walk 0.009 1 0.9 0.001 0.1 0.6 0.005 3 2 0.8 0.395 0.1 If it starts from node 1 at time=0 Then the probability that walks to node 3 at time=1 0.0090.90.01 0.0090.90.01 0.10.80.1 0.10.80.1 H= 0.60.0050.395 0.60.0050.395 = 1 0 0 0.6 0.005 0.395

Vertex weights • Factor s balances exhaustivity with global saliency to identify stems that are poor discriminators been relevant and non-relevant documents frequency of a word wnin N ,averagedover k + 1 documents, and normalized by the maximum average frequency of any term in N the number of documents in C containing wn TREC query #840：‘Give the definition, locations, or characteristics of geysers’. => “definition geysers” is not more informative

Example |N| = 3 , |C|=35 Wn = geysers The avg frequency of “geysers” in N = 12/3 max avg frequency of any term in N = 4 dfwn = 3 Wn = definition The frequency of “definition” in N = 2/3 max avg frequency of any term in N = 4 dfwn = 1

Term ranking • Input：all combinations of 1-3 words in a query that are not stopwords. • Output：Rank list sorted by f(x,Q) score • To avoid a bias towards longer terms, a term x is scored by averaging the affinity scores for its component words • factor zx that represents the degree to which the term is discriminative in a collection the frequency of xe in C

Query: Locations of volcanic activity which occurred within the present day boundaries of the U.S. and its territories. • example Term x = volcanic boundaries Term x = volcanic U.S

Outline • Introduction • The PhRank Algorithm • Diversity filter • Experiment • Conclusions

Diversity filter • PhRank often assigns a high rank to multi-word terms that contain only one highly informative word • For example, query: the destruction of Pan Am Flight 103 over Lockerbie, Scotland • term ‘pan flight 103 ’ is informative • “pan” is uninformative by itself Example： Way 1： Way 2： . . declining birth . birth rate . . . declining birth rate . . birth rate china . . . . birth rate Discarded! on the assumption that the shorter terms better represent the information need and the longer term is redundant. on the assumption that the longer term better represents the information need.

Experiment • Dataset F：excluded from features ，T：include in features

Experiment

Experiment TREC description topics TREC title queries

Conclusions • have presented PhRank, a novel term ranking algorithm that extends work on Markov chain frameworks for query expansion to select focused and succinct terms from within a query. • For all collections, around 26% of queries have more than 5% decrease in MAP compared to SD • Efficiency considerations surrounding the time to construct an affinity graph may be ameliorated by off-line indexing to precompute a language model for each document in a collection.

Compact Query Term Selection Using Topically Related Text

Compact Query Term Selection Using Topically Related Text

Presentation Transcript

Typography -- Using Text

OPTIMAL TEXT SELECTION ALGORITHM

Using Abbreviated Text

Professional Sales Term Project THE COMPACT CAR

Interactively Co- segmentating Topically Related Images with Intelligent Scribble Guidance

Compact Query Term Selection Using Topically Related Text

Topically Applied Corticosteroids

Dynamic Sample Selection for Approximate Query Processing

The Min-dist Location Selection Query

Query Answering using Views

Quicklink Selection for Navigational Query Results

Text (Term 2)

Materialized View Selection and Maintenance using Multi-Query Optimization

Extension of disease-related pathway using text mining

Feature Selection on Chinese Text Classification Using Character N-grams

Term Selection

TQL (text query language)

Query Suggestions Using Query-Flow Graphs

Automatic Indexing (Term Selection)

Using Text Components

Quicklink Selection for Navigational Query Results