300 likes | 388 Views
Phrase Identification from Queries and Its Use for Web Search. Fuchun Peng Microsoft Bing 7/23/2010. Motivation. Query is often treated as a bag of words But when people are formulating queries, they use “concepts” as building blocks. sports psychology (course). simmons college ’s.
E N D
Phrase Identification from Queries and Its Use for Web Search Fuchun Peng Microsoft Bing 7/23/2010
Motivation • Query is often treated as a bag of words • But when people are formulating queries, they use “concepts” as building blocks sports psychology (course) simmons college’s Q: simmons college sports psychology A1: “simmons college”, “sports psychology” A2: “college sports” • Can we automatically segment the query to recover the concepts?
Outline • Summary of Segmentation approaches • Use for Improving Search Relevance • Query rewriting • Ranking features • Conclusions
w1 w2 w3 w4 w5 Y N N Y Supervised Segmentation • Supervised learning (Bergsma et al, EMNLP-CoNLL07) • Binary decision at each possible segmentation point • Features: POS, web counts, the, and, … • Problem: • Limited-range context • Features specifically designed for noun phrases
Training Data Annotation • Manual Data Preparation • Linguistic driven • [San jose international airport] • Relevance driven • [San jose] [international airport]
3,4 MI 1,2 4,5 threshold 2,3 w1 w2 w3 w4 w5 Mutual-information based (Risviket al. WWW 2003) MI(w1,w2) = P(w1w2) / P(w1)P(w2) insert segment boundary w1w2 | w3w4w5 Iterative update • Problem: • only captures short-range correlation (between adjacent words) • What about my heart will go on?
LM Based Approach(Tan & Peng WWW 2008) • Assume the query is generated by independent sampling from a probability distribution of concepts: simmons collegesports psychology P=0.000016×0.000002 P(sports psychology)=0.000002 P(simmons college)=0.000016 > unigram model P=0.000007×0.000006×0.000024 simmonscollege sports psychology P(simmons)=0.000007 P(college sports)=0.000006 P(psychology)=0.000024 • Enumerate all possible segmentations; Rank by probability of being generated by the unigram model • How to estimate parameters P(w) for the unigram model?
Parameter (Concept Prob.) Estimation I • We have ngram (n=1..5) counts in a web corpus • 464M documents; L = 33B tokens • Approximate counts for longer ngrams are often computable: e.g. #(harry potter and the goblet of fire) is in [5783, 6399] • #(ABC)=#(AB)+#(BC)-#(AB OR BC) >= #(AB)+#(BC)-#(B) Solved by DP
Parameter Estimation • Maximum Likelihood Estimate: PMLE(t) = #(t) / N • Problem: • #(potter and the goblet of) = 6765 • P(potterand the goblet of) > P(harrypotter and the goblet of fire)? Wrong! • not prob. of seeing t in text, but prob. of seeing tas a self-contained concept in text
Choose parameters to maximize the posterior probability given query-relevant corpus / minimize the total description length) t: a query substring C(t): longest matching count of t D = {(t, C(t)}: query-relevant corpus s(t): a segmentation of t θ: unigram model parameters (ngram probabilities) θ = argmax P(D|θ)P(θ) = argmax log P(D|θ) + log P(θ) log P(D|θ) = ∑t log P(t|θ)C(t) P(t|θ) = ∑ s(t) P(s(t)|θ) posterior prob. DL of corpus DL of parameters Parameter Estimation Query-relevant web corpus
Evaluation – Data sets • Three human-segmented datasets • 3 data sets, for training, validation, and testing, 500 queries for each set • Segmented by three editors A, B, C
w1 w2 w3 w4 w5 Y N N Y Evaluation -- metrics • Evaluation metric: • Boundary classification accuracy • Whole query accuracy: the percentage of queries with perfect boundary classification accuracy • Segment accuracy: the percentage of segments being recovered • Truth [abc] [de] [fg] • Prediction: [abc] [de fg]: precision
Outline • Summary of Segmentation approaches • Use for Improving Search Relevance • Query rewriting • Ranking features • Conclusions
Use for Improving Relevance • Phrase Proximity Boosting • Phrase Level Query Expansion
Phrase Proximity Boosting • Classifying a segment into one of three categories • Strong concept: no word reordering, no word insertion/deletion • Treat the whole segment as a single unit in matching and ranking • Weak concept: allow word reordering or deletion/insertion • Boost documents matching the weak concepts • Not a concept • Do nothing
Phrase Proximity Boosting • Concept based BM25 • Weighted by the confidence of concepts • Concept based min coverage • Weighted by the confidence of concepts
Phrased Based Expansion • Phrase level replacement • [San Francisco] -> [sf] • [red eye flight] ->[late night flight]
Relevance Results • Significant relevance boosting • Affects 40% query traffic • Significant DCG gain (1.5% for affected queries) • Significant online CTR gain (0.5% over all)
Outline • Summary of Segmentation approaches • Use for Improving Search Relevance • Query rewriting • Ranking features • Conclusions
Conclusions • Data is segmentation is important for query segmentation • Phrases are important for improving relevance
References • Bergsma et al, EMNLP-CoNLL07 • Risvik et al. WWW 2003 • Hagen et al SIGIR 2010 • Tan & Peng, WWW 2008
Parameter Estimation II • Solution 1: Offline segment the web corpus, then collect counts for ngrams being segments ... … | Harry Potter and the Goblet of Fire | is | the | fourth | novel | in | the | Harry Potter series | written by | J.K. Rowling | ... ... harry potter and the goblet of fire += 1 potter and the goblet of += 0 C. G. de Marcken, Unsupervised Language Acquisition, 96 Fuchun Peng, Self-supervised Chinese Word Segmentation, IDA01 • Technical difficulties
Parameter Estimation III • Solution 2: Online computation: only consider parts of the web corpus overlapping with the query (longest matches) Q=harry potter and the goblet of fire ... … Harry Potter and the Goblet of Fire is the fourth novel in theHarry Potter series written by J.K. Rowling ... ... harry potter and the goblet of fire += 1 the+= 2 harry potter += 1
Parameter Estimation III • Solution 2: Online computation: only consider parts of the web corpus overlapping with the query (longest matches) Q= potter and the goblet ... … Harry Potter and the Goblet of Fire is the fourth novel in the Harry Potter series written by J.K. Rowling ... ... potter and the goblet += 1 the+= 2 potter += 1 Directly compute longest matching counts using raw ngram frequency: O(|Q|2)