180 likes | 311 Views
Personalized Query Expansion for the Web. P. Chirita , C. S. Firan , & W. Nejdl Published in SIGIR 07. Introduction. Web query reformulation by exploiting the user’s Personal Information Repository (PIR) Desktop (as a PIR) is a rich repository of information about the user’s interest.
E N D
Personalized Query Expansion for the Web P. Chirita, C. S. Firan, & W. Nejdl Published in SIGIR 07
Introduction • Web query reformulation by exploiting the user’s Personal Information Repository (PIR) • Desktop (as a PIR) is a rich repository of information about the user’s interest. • Keyword, expression, and summary based techniques are proposed.
Previous Work • Personalized Search • User profiles: • ex. User profiling based on browsing history • Requires server side storage for all personal information, raising privacy concerns. • The actual search algorithm • Build the personalization aspect directly into Page-Rank (a target set of pages)
Previous Work • Automatic Query Expansion • Exploiting various social or collection specific characteristics to generate additional terms • Relevance Feedback Techniques • TF, DF, summarization • Co-occurrence Based Techniques • Highly co-occurring terms, terms in lexical affinity relationships are added. • Thesaurus Based Techniques: WordNet • Closely related terms in meaning are added.
Expanding with Local Desktop Analysis • TF • DF • Given the set of Top-K relevant Desktop documents • Generate their snippets as focused on the original search request • Identify the set of candidate terms • Order them according to the DF scores they are associated with nrWords: the total number of terms in the documentpos: the position of the first appearance of the term
Lexical Compounds • Use simple noun analysis • Sentence Selection • Identify the set of relevant Desktop documents • Generate a summary containing their most important sentences • Treshold
Cosine Similarity • Mutual Information • Likelihood Ratio
Experiments • 4 queries were chosen • One very frequent AltaVista query • One randomly selected log query • One self-selected specific query • One self-selected ambiguous query • Collect the top-5 URL generated by 20 version of algorithms. Shuffle them. Each subject assess about 325 documents for 4 queries • Give a rating ranging from 0 to 2. • Assessed with NDCG (Normalized Discounted Cumulative Gain) • T-test was done.
Algorithm • Tested • Base line: Google • TF, DF • LC, LC(O): Lexical Compounds regular and optimized (considering one top compound) • SS: Sentence Selection • TC[CS], TC[MI], TC[LR]: Term Co-occurrence Statistics using respectively Cosine Similarity, Mutual Information, and Likelihood Ratio as similarity coefficients • WN[SYN], WN[SUB], WN[SUP]: with synonyms, sub-concepts, and super-concepts
Adaptivity • Query Scope • Query Clarity the probability of the word w within the submitted query the probability of w within the entire collection of documents
Query Formulation Process • the newly added terms are more likely to convey information about her search goals • giving more weight to new keywords
Application to the Project • Collected news articles by the user can be treated as the user’s desktop. So that we can apply their algorithms to our system.