Improving Similarity Measures for Short Segments of Text

Improving Similarity Measures for Short Segments of Text Scott Wen-tau Yih & Chris Meek Microsoft Research

querymariners Query Suggestion How similar are they? mariners vs. seattle mariners • mariners vs. 1st mariner bank

Keyword Expansion for Online Ads Chocolate Cigarettes How similar are they? • chocolate cigarettes vs. cigarettes • chocolate cigarettes vs. chocolate cigars • chocolate cigarettes vs. old fashioned candy Candy cigarettes Chocolate cigars Chocolate candy Old fashioned candy Nostalgic candy Novelty candy

Measuring Similarity • Goal: create a similarity function • fsim: (String1,String2)  R • Rank suggestions • Fix String1 as q; vary String2 as s1 , s2 ,  , sk • Whether the function is symmetric is not important • For query suggestion – fsim(q,s) • fsim(“mariners”, “seattle mariners”) = 0.9 • fsim(“mariners”, “1st mariner bank”) = 0.6

Enabling Useful Applications • Web search • Ranking query suggestions • Segmenting web sessions using query logs • Online advertising • Suggesting alternative keywords to advertisers • Matching similar keywords to show ads • Document writing • Providing alternative phrasing • Correcting spelling errors

Challenges • Short text segments may not overlap • “Microsoft Research” vs. “MSR”  0 cosine score • Ambiguous terms • “Bill Gates” vs. “Utility Bill”  0.5 cosine score • “taxi runway” vs. “taxi”  0.7 cosine score • Text segments may rarely co-occur in corpus • “Hyatt Vancouver” vs. “HayttVancover”  1 page • Longer query  Fewer pages

Our Contributions • Web-relevance similarity measure • Represent the input text segments as real-valued term vectors using Web documents • Improve term weighting scheme based on relevant keyword extraction • Learning similarity measure • Fit user preference for the application better • Compare learning similarity function vs. learning ranking function

Outline • Introduction • Problem, Applications, Challenges • Our Methods • Web-relevance similarity function • Combine similarity measures using learning • Learning similarity function • Learning ranking function • Experiments on query suggestion

Web-relevance Similarity Measure • Query expansion of x using a search engine • Let Dn(x) be the set of top n documents • Build a term vector vi for each document diDn(x) • Elements are scores representing the relevancy of the words in document di • C(x) = 1n vi /||vi|| (L2-normalized, centroid) • QE(x) = C(x) / ||C(x)|| (L2-normalized) • Similarity score is simply the inner product • fsim (q,s) = QE(q)  QE(s)

Web-kernel Similarity • Relevancy = TFIDF [Sahami&Heilman ‘06] • Why TFIDF? • High TF: important or relevant to the document • High DF: stopwords or words in template blocks • Crude estimate of the importance of the word Can we do better than TFIDF?

Web-relevance Similarity • Relevancy = Prob(relevance | wj,di) • Keyword extraction can judge the importance of the words more accurately! [Yih et al. WWW-06] • Assign relevancy scores (probabilities) to words/phrases • Machine Learning model learned by logistic regression • Use more than 10 categories of features • Query-log frequency • High-DF words may be popular queries • The position of the word in the document • The format, hyperlink, etc.

Learning Similarity • Similarity measures should depend on application • q=“Seattle Mariners” s1=“Seattle”s2=“Seattle Mariners Ticket” • Let human subjects decide what’s similar • Parametric similarity function fsim(q,s|w) • Learn the parameter (weights) from data • Use Machine Learning to combine multiple base similarity measures

Base Similarity Measures • Surface matching methods • Suppose Q and S are the sets of words in a given pair of query q and suggestion s • Corpus-basedmethods • Web-relevance, Web-kernel, KL-divergence

Learning Similarity Function • Data – pairs of query and suggestion (qi,sj) • Label: Relevance judgment (rel=1 or rel=0) • Features: Scores on (qi,sj) provided by multiple base similarity measures • We combine them using logistic regression z = w1Cosine(q,s) + w2Dice(q,s) + w3Matching(q,s) + w4Web-relevance(q,s) + w5KL-divergence(q,s) +  fsim(q,s|w) = Prob(rel|q,s;w) = exp(z)/(1+exp(z))

Learning Ranking Function • We compare suggestions sj,sk to the same query q • Data – tuples of a query q and suggestions sj, sk • Label: [sim(q,sj) > sim(q,sk)] or[sim(q,sj) < sim(q,sk)] • Features: Scores on pairs (q,sj) and (q,sk) provided by multiple base similarity measures • Learn a probabilistic model using logistic regression • Prob([sim(q,sj) > sim(q,sk)] | q,sj,sk;w)

Experiments • Data: Query suggestion dataset [Metzler et al. ’07] • |Q| = 122, |(Q,S)| = 4852; {Ex,Good} vs. {Fair,Bad} • Results • 10-fold cross-validation • Evaluation metrics: AUC and Precision@k

AUC Scores

Precision@3

Conclusions • Web-relevance • New term-weighting scheme from keyword extraction • Outperform existing methods on query suggestion • Learning similarity • Fit the application – better suggestion ranking • Learning similarity function vs. learning ranking function • Future work • Experiment with alternative combination methods • Explore other probabilistic models for similarity • Apply our similarity measures to different tasks

Improving Similarity Measures for Short Segments of Text