380 likes | 451 Views
Information Retrieval in Context. Presenter: Xuehua Shen xshen@uiuc.edu. Presentation Layout. Problem Description Terminology Challenges IntelliZap System[WWW2001] Concerns. Problem. Search Engine has become key source of information
E N D
Information Retrieval in Context Presenter: Xuehua Shen xshen@uiuc.edu Xuehua Shen @CS, UIUC
Presentation Layout • Problem Description • Terminology • Challenges • IntelliZap System[WWW2001] • Concerns Xuehua Shen @CS, UIUC
Problem • Search Engine has become key source of information 1998[GVU WWW Study]: 85% people use search engine to locate information Now [Craig’s Talk]: 500 million search on Internet per day 150 million search at Google per day • Efforts on Coverage and Relevance Xuehua Shen @CS, UIUC
Web Search Fact • Given 3-5 billion web pages on the Web huge and diverse info provided by Web • On average 1.7-words per query [Eric Brewer CACM 09/2002] little info provided by Users • Can search engine retrieve web pages very well? Xuehua Shen @CS, UIUC
Context • Context may provide extra information to help improve search result relevance • An example: Searching flowers [DirectHit 1999] Man: typically want sites that let them send flowers Woman: often want sites that let them order flower seeds or plants for gardening purposes • What context information useful? Xuehua Shen @CS, UIUC
Terminology • Ephemeral Context In a single search session Category[Inquirus2], Document being viewed [Watson], Feedback • Persistent Context increment over time, used in subsequent sessions User profile [My Yahoo!], Query history & Clickthrough Data [Google] Xuehua Shen @CS, UIUC
Terminology cont. • Personalization Search Engine use context information to provide different search results for different users • Customization Users manually configure their preferences Xuehua Shen @CS, UIUC
Challenges • How to capture and store useful information? • SearchPad[WWW2001]: • Server-proxy-client architecture • User explicitly mark relevant pages • Any shortcomings? Better ways? Xuehua Shen @CS, UIUC
Challenges cont. • Many retrieval models, also many user models, But how to merge them? • language model is used to represent context by Croft Xuehua Shen @CS, UIUC
Challenges • How to build such system, such as architecture Server side, client side? User Interface? • Server side: scalability, privacy • Client side: communication of context info with server Xuehua Shen @CS, UIUC
Challenges • How to evaluate such work? Metrics? • HARD (Hard Accuracy Relevance from Document) Track added this year leverage additional information about searcher and/or search context Xuehua Shen @CS, UIUC
Intellizap – General Description • Assumption: a large fraction of searches originate while users are reading documents on their computers. • Standpoint: Context is a body of words of surrounding a user-selected phrase • Intellizap System: Meta Search Engine with context-based query augmentation, search engine selection and reranking Xuehua Shen @CS, UIUC
Walkthrough of IntelliZap Xuehua Shen @CS, UIUC
Walkthrough cont. Xuehua Shen @CS, UIUC
Walkthrough cont. Xuehua Shen @CS, UIUC
Walkthrough cont. Xuehua Shen @CS, UIUC
Walkthrough cont. Xuehua Shen @CS, UIUC
How to use Context • augment query before sending queries to search engines • rerank the results returned by search engines Xuehua Shen @CS, UIUC
How to collect right amount of context • Don’t include all document as Watson System • Heuristics 1 establishing optimal context length as a function of the length of text phrase and individual frequencies • Heuristics 2 relative weighting of the text and context in augmented query emphasize marked text phrase weight of context word: monotonic function of their proximity to text Xuehua Shen @CS, UIUC
Algorithm Overview Xuehua Shen @CS, UIUC
Step 0: Semantic Network • Build Semantic Network (offline): statistics-based semantic network • Linear combination of vector-based correlation metric and WordNet-based metric Xuehua Shen @CS, UIUC
Semantic Network cont. • Vector-based correlation metric: 27 knowledge domains (computer, business etc.) 10,000 documents samples on Internet each word: a 27-dimension vector use correlation to measure distance • WordNet: capture semantic relations between words (hypernymy, hyponymy, meronymy and holonymy). WordNet:http://www.cogsci.princeton.edu/~wn/ Xuehua Shen @CS, UIUC
Step 1: Query Augmentation • Extract keywords from context surrounding the user-selected text utilizing semantic network typically context – about 50 words • use clustering algorithm to construct several queries of different topics Xuehua Shen @CS, UIUC
Step 2: Search Engine Selection • IntelliZap is a Meta Search Engine • Several general search engines ( such as Google, Altavista) • For several domains, specific search engines( such as WebMD, FindLaw) is assigned to as a priori. Xuehua Shen @CS, UIUC
Step 3: Results Reranking • There are several lists of results returned by several search engines. • Use semantic network to calculate distance between results titles/summaries and text/context Xuehua Shen @CS, UIUC
Evaluation Method • State-of-the-art: lack the benchmark • Use subjects recruited by external agency • Subjects don’t know objective of the experiments, just asked to do search and evaluate results Xuehua Shen @CS, UIUC
Experiment Result Xuehua Shen @CS, UIUC
Experiment Results cont. Xuehua Shen @CS, UIUC
Concerns • Privacy and security Million users info database of My Yahoo! Monitor users through queries they sent! • Relevance consistency Communication Problem Xuehua Shen @CS, UIUC
End • Thank you! Xuehua Shen @CS, UIUC
Backup Slides Xuehua Shen @CS, UIUC
Web Statistics • Accessibility of Information on the Web Steve Lawrence, Nature 1999 Xuehua Shen @CS, UIUC
Semantic Relation • Hypernymy: the semantic relation of being superordinate or belonging to a higher rank or class Synonym: superordination • Hyponymy: the semantic relation of being subordinate or belonging to a lower rank or class Synonym: subordination • Meronymy: the semantic relation that holds between a part and the whole Synonym: part to whole relation • Holonymy: the semantic relation that holds between a whole and its partsSynonym: whole to part relation • More at http://dictionary.metor.com/wnet/ Xuehua Shen @CS, UIUC
Clustering algorithm • Traditional clustering algorithm doesn’t work due to a large amount of noise and a small amount of information available 50 context words represented in 27 D space • Special clustering algorithm-High Dimensional clustering perform Recurrent Clustering analysis (averaging over iterations) refine results statistically Xuehua Shen @CS, UIUC
Limitation of Web • Freshness • Coverage( only publicly indexable web) • Bias (not index sites equally) Xuehua Shen @CS, UIUC
Several Systems--1 • Inquirus2: meta search engine • Watson Project (Jay Budzik,NWU): contents of full documents being edited in MS Word or Viewed in Explorer • Remembrance Agent (Bradley Rhodes,MIT): software agent just-in-time information retrieval Xuehua Shen @CS, UIUC
Several System--2 • Outride (renamed in 2001) GroupFire (spin off from PARC Xerox) in 2000 Xuehua Shen @CS, UIUC
Reference • [1] Graphic,Visualization and Usability Center GVU’s 10th WWW User Survey,1998 Xuehua Shen @CS, UIUC