200 likes | 409 Views
Query Suggestion. Naama Kraus. Slides are based on the papers: Baeza -Yates, Hurtado , Mendoza, Improving search engines by query clustering Boldi , Bonchi , Castillo, Donato , Vigna , The Query Flow Graph: Model and Applications. The Problem.
E N D
Query Suggestion Naama Kraus Slides are based on the papers: Baeza-Yates, Hurtado, Mendoza, Improving search engines by query clustering Boldi, Bonchi, Castillo, Donato, Vigna, The Query Flow Graph: Model and Applications
The Problem • User queries are an imperfect description of their information needs • Examples: Ambiguous queries: jaguar General queries: haifa Terminology differences (synonyms) between user and corpus stars - planets
Query Suggestions Assist the user to phrase her information need jaguar • Jaguar car • Jaguar xf • Jaguar animal • Jaguar cat
Query suggestion algorithms • Query suggestions are extracted from the query log • There are methods that use different data sources such as a corpus, not covered today • Topic (cluster) based – identify groups of similar queries • Sequence based – mine and analyze the query log for likely query sequences
Improving Search Engines by Query Clustering - Baeza-Yates et al. • Algorithm outline • Offline: • Represent queries as term weighted vectors • Cluster queries • Rank queries in each cluster • Online: • Given user’s query q • Find cluster C containing q • Suggest top k queries in cluster C • Based on their rank and similarity to q
Query Model • Given query q • Let U be the set of URLs clicked for q (for all users and sessions) • Information is extracted from the query log • q’s term weighted vector has a non 0 entry for any term that appears in some URL in U • Terms are weighted according to • Term frequencyand URLs popularity • Formula in next slide …
Query Model (2) - The number of clicks of u for the query q Note: paper proposes a refinement to Pop(u,q) which is not biased by search engine’s ranking Query similarity is computed by some measure, e.g. cosine similarity.
Query Support • The fraction of the documents returned by the query that captured the attention of users (clicked documents) • Denotes how ‘good’ is a query • A ‘global score’ • Queries within a cluster are ranked according to their similarity to q as well as their support
Query Flow Graph – Boldi et al. • Main idea: • Aggregate the (massive) raw data in the query log • Many queries of many users • Model user query behavior • Use sophisticated techniques to infer query relatedness
Query Flow Graph Model • G=(V, E, w) a directed graph where: • V – nodes, representing a distinct set of queries Q • Queries are extracted from the query log • A set of directed edges E • Two queries q,q’ are connected with an edge if q’ follows q in at least one session
QFG Illustration Nodes are queries Edges connect between queries q4 q1 q5 apple ipod q0 q2 q3 apple store
Weighting Function • w : E -> (0..1] a weighting function that assigns a weight to every edge (q,q’) • For each edge (q,q’) assign a probability that q’ follows q in the same session • Extracted from the observed query log sessions
Illustration q4 0.1 1.0 q1 0.55 q5 0.5 0.2 0.35 q0 q2 0.25 0.8 1.0 0.25 q3
Random walk on the QFG • A random surfer executes a random walk on the graph as follows: • Start at a some node • Move along an edge with probability d • Choose an edge by its probability (weight) • Or teleport to a random node with probability 1-d • Choose an edge uniformly • The Stationary distribution • The probability to be at node q in the infinity • Random walk score vector – query absolute scores
Random Walk Relative to a Node • Random walk with restart to a single node: • Start at node q • Instead of teleporting to any node, always teleport to q • The score of node q’ for this random walk measures relatednessof q’ to q • The probability to get from q to q’ in the infinity • Can normalize node’s relative score by its absolute score ; similar somehow to tfxidf – avoid highly popular queries (non related to q)
The Full Picture • Off-line stage • For each node q in the graph • Compute the stationary distribution vector of q • A random walk score relative to q • Store suggestions for q, alternatives: • top k scored nodes • nodes having a score above some threshold • On-line stage • User submits query q • Suggest queries stored for q • Queries most related to q