200 likes | 317 Views
Structured Annotations of Web Queries. Author: N. Sarkas et al. SIGMOD 2010. Agenda. Motivation Approach Query Processing Probabilistic Model. Motivation. Do keyword search on structured database The user leverages the experience on using search engine to search records in a database
E N D
Structured Annotations of Web Queries Author: N. Sarkas et al.SIGMOD 2010
Agenda • Motivation • Approach • Query Processing • Probabilistic Model
Motivation • Do keyword search on structured database • The user leverages the experience on using search engine to search records in a database • Problems • Keyword matching may not get any result • Misinterpretation of the user’s intent • Q=“white tiger” • T=Shoes, Color=white, Shoe Line=tiger • T=Book, title=white tiger • The animal white tiger (not in the database) • Provide fast response (in ms) for web users • Not efficient if the system queries every database
Approach • Annotation • Annotation token for a table (AT) = (t, T.A) • Q=50 inch LG • (LG,TVs.Brand), (LG,Monitors.Brand),(50 inch,TVs.Diagonal) • All tokens (nominal, ordinal and numerical) are annotated on the table. • Accept numerical tokens in a certain range
Overall Query Processing • Generate structured annotation for the keywords • Find the maximal annotation • Annotation scoring
Query Processing • Generate structured annotation for a query • Keyword: K1, k2 • (T1, {(k1,T1.Att1)(…)}, {free token})(T2, {(k1,T2.Att1) (…)}, {}) • Free token: a query keyword not associated with an attribute • Example “50 inch LG lcd” • (TVs, {(50 inch, TVs.Diagonal), (LG, TVs.Brand), (lcd, TVs.screen)}, {}) • (Monitors, {(50 inch, Monitors.Diagonal),(LG, Monitors.Brand), (lcd, Monitors.Screen)}, {}) • (Refrig, {(50 inch, Refrig.Width), (LG, Refrig.Brand)}, {lcd})
Query Processing • Find the maximal annotation • Given a table, we want more annotation tokens, less free tokens • Annotation S = (T, AT, FT) • there’s no S’ = (T, AT’, FT’) s.t. AT’ > AT, FT’ < FT • AT: annotated token • FT: free token • Example • S1=(TVs, {(LG, TVs.Brand), (lcd, TVs.screen)}, {}) • S2=(TVs, {(LG, TVs.Brand), {lcd}) • S3=(TVs, {(50 inch, TVs.Diagonal), (LG, TVs.Brand), {lcd}) • S2 is not maximal
Query Processing • Scoring annotation • Intuition • Query: LG 30 inch screen • Want: TV, monitor • Dislike • DVD Player • There’s no DVD player with a screen in the database • People don’t query size of a DVD player • Cell phone • The size of the screen is significantly smaller in the database • A probabilistic model is chosen for the scoring
Probabilistic Model • Generative probabilistic model • If the user searches a table, what are the words that the user may use (the probability of each word)? • P(T.Ai): the probability that the user search table T and the subset of the attributes T.Ai • Given attributes, users select tokens with probability • : the attributes of table T + free tokens • Example “LG 30 inch screen” • Need to simplify the equation
Probabilistic Model • Assumption 1 • Annotated and free tokens are independent • Assumption 2 • The user depends on the table to choose free tokens • The user depends on the attributes of the table to choose annotated tokens, not on free tokens • ……………………(2) • Si: given query q, the annotation Sq=S1,…,Sk
Probabilistic Model • The equation (2) assumes that all queries are targeting some table in the data collection. • Not true. Ex: Q=“green apple”. • Annotation: green=color, apple = brand. • Could green apple mean a fruit? • Approach • Open Language Model (OLM) table: capture open-world queries (ex: the log of Bing ) • Sq={S1,…,Sk, Sk+1}, where Sk+1=SOLM. • SOLM=(OLM,{FTq}) • ……………………. (3) • To keep plausible annotation
Probabilistic Model • We have two probabilistic models • ……………(2) • …………………. (3) • What’s next? • Maximize the probability • Simplify the equation when necessary • Build up a system which is based on the model
The Probabilistic Model • Consider a query from web log • It can either be formulated by an annotation or is a free-text query. • = • = * UMT=all possible names and values that can be associated with table T = confidence level Ex: FT=computer. T1=Monitor, T2=TV
Probabilistic Model • Given • Observed data: web query log • Model: = • To Find • and that maximize the likelihood
Expectation-Maximization (EM) • Initial step • Select initial • Repeat • Expectation step • Based on current , estimate • Maximization step • Based on current , maximize
Probabilistic Model • The fraction of the entries in a table T that take the values ATi. • Q=“50 inch LG lcd” • S = (TVs, {(LG,TVs.Brand),(50 inch, TVs.Diagonal),{lcd}). • T.A = {Brand, Diagonal} • T(AT.V) = all records in TV of brand LG with diagonal size 50 inch. • [offline] A mapping from the value to the number of matched records in the table
Probabilistic Model • Maximum likelihood estimation • Given • A set of observed data , …} • A proposed model • To find • The parameter that maximize the likelihood • Rephrase • Given • Observed data: web query log • Model: = • To Find • and that maximize the likelihood • EM algorithm can be used to solve Likelihood If I were to flip a fair coin 100 times, what is the probability of it landing heads-up every time?“ Given that I have flipped a coin 100 times and it has landed heads-up 100 times, what is the likelihood that the coin is fair?“ (source: wikipedia)
Expectation-Maximization (EM) • An iterative algorithm with 2 steps • Expectation step • Estimate parameters • Maximization step • Calculate expected value Q of the log likelihood function • Find parameter that maximize Q.