340 likes | 518 Views
Lecture 16: Web Search Engines. Oct. 27, 2006 ChengXiang Zhai. Characteristics of Web Information. “Infinite” size (Surface vs. deep Web) Surface = static HTML pages Deep = dynamically generated HTML pages (DB) Semi-structured Structured = HTML tags, hyperlinks, etc
E N D
Lecture 16: Web Search Engines Oct. 27, 2006 ChengXiang Zhai
Characteristics of Web Information • “Infinite” size (Surface vs. deep Web) • Surface = static HTML pages • Deep = dynamically generated HTML pages (DB) • Semi-structured • Structured = HTML tags, hyperlinks, etc • Unstructured = Text • Different format (pdf, word, ps, …) • Multi-media (Textual, audio, images, …) • High variances in quality (Many junks) • “Universal” coverage (can be about any content)
General Challenges in Web Search • Handling the size of the Web • How to ensure completeness of coverage? • Efficiency issues • Dealing with or tolerating errors and low quality information • Addressing the dynamics of the Web • Some pages may disappear permanently • New pages are constantly created
Tasks in Web Information Management • Information Access • Search (Search engines, e.g. Google) • Navigation (Browsers, e.g. IE) • Filtering (Recommender systems, e.g., Amazon) • Information Organization • Categorization (Web directories, e.g., Yahoo!) • Clustering (Organize search results, e.g., Vivsimo) • Information Discovery (Mining, e.g. ??)
Typology of Web Search Engines Yahoo! Vivisimo Directory Service Meta Engine Directories Engine profiles Web Browser Google Search Engine Web Index Spider/Robot
Examples of Web Search Engines Web Browser Directory Service Meta Engines Autonomous Engines
Basic Search Engine Technologies … Browser Query Host Info. Results Retriever Efficiency!!! Precision Crawler Coverage Freshness ---- ---- … ---- ---- ---- ---- … ---- ---- … Indexer Cached pages Error/spam handling (Inverted) Index User Web
Component I: Crawler/Spider/Robot • Building a “toy crawler” is easy • Start with a set of “seed pages” in a priority queue • Fetch pages from the web • Parse fetched pages for hyperlinks; add them to the queue • Follow the hyperlinks in the queue • A real crawler is much more complicated… • Robustness (server failure, trap, etc.) • Crawling courtesy (server load balance, robot exclusion, etc.) • Handling file types (images, PDF files, etc.) • URL extensions (cgi script, internal references, etc.) • Recognize redundant pages (identical and duplicates) • Discover “hidden” URLs (e.g., truncated) • Crawling strategy is a main research topic (i.e., which page to visit next?)
Major Crawling Strategies • Breadth-First (most common; balance server load) • Parallel crawling • Focused crawling • Targeting at a subset of pages (e.g., all pages about “automobiles” ) • Typically given a query • Incremental/repeated crawling • Can learn from the past experience • Probabilistic models are possible The Major challenge remains to maintain “freshness” and good coverage with minimum resource overhead
Component II: Indexer • Standard IR techniques are the basis • Lexicons • Inverted index (with positions) • General challenges (at a much large-scale!) • Basic indexing decisions (stop words, stemming, numbers, special symbols) • Indexing efficiency (space and time) • Updating • Additional challenges • Recognize spams/junks • How to index anchor text? • How to support “fast summary generation”?
Google’s Solutions URL Queue/List Cached source pages (compressed) Inverted index Use many features, e.g. font, layout,… Hypertext structure
Component III: Retriever • Standard IR models apply but aren’t sufficient • Different information need (home page finding vs. topic-driven) • Documents have additional information (hyperlinks, markups, URL) • Information is often redundant and the quality varies a lot • Server-side feedback is often not feasible • Major extensions • Exploiting links (anchor text, link-based scoring) • Exploiting layout/markups (font, title field, etc.) • Spelling correction • Spam filtering • Redundancy elimination
Effective Web Retrieval Heuristics • High accuracy in home page finding can be achieved by • Matching query with the title • Matching query with the anchor text • Plus URL-based or link-based scoring (e.g. PageRank) • Imposing a conjunctive (“and”) interpretation of the query is often appropriate • Queries are generally very short (all words are necessary) • The size of the Web makes it likely that at least a page would match all the query words
Home/Entry Page Finding Evaluation Results(TREC 2001) MRR %top10 %fail 0.774 88.3 4.8 0.772 87.6 4.8 Unigram Query Likelihood + Link/URL prior i.e., p(Q|D) p(D) [Kraaij et al. SIGIR 2002] Exploiting anchor text, structure or links 0.382 62.1 11.7 0.340 60.7 15.9 Query example: Haas Business School
Named Page Finding Evaluation Results (TREC 2002) Okapi/BM25 + Anchor Text Dirichlet Prior + Title, Anchor Text (Lemur) [Ogilvie & Callan SIGIR 2003] Best content-only Query example: America’s century farms
Many heuristics are proposed for Web retrieval, but is there a model for Web retrieval? How do we extend regular text retrieval models for Web retrieval? Mostly an open question!
“Free text” vs. “Structured text” • So far, we’ve assumed “free text” • Document = word sequence • Query = word sequence • Collection = a set of documents • Minimal structure … • But, we may have structures on text (e.g., title, hyperlinks) • Can we exploit the structures in retrieval? • Sometimes, structures may be more important than the text (moving closer to relational databases … )
Examples of Document Structures • Intra-doc structures (=relations of components) • Natural components: title, author, abstract, sections, references, … • Annotations: named entities, subtopics, markups, … • Inter-doc structures (=relations between documents) • Topic hierarchy • Hyperlinks/citations (hypertext)
Structured Text Collection General question: How do we search such a collection? A general topic Subtopic k Subtopic 1 ...
Challenges in Structured Text Retrieval (STR) • Challenge 1 (Information/Doc Unit): What is an appropriate information unit? • Document may no longer be the most natural unit • Components in a document or a whole Web site may be more appropriate • Challenge 2 (Query): What is an appropriate query language? • Keyword (free text) query is no longer the only choice • Constraints on the structures can be posed (e.g., search in a particular field or matching a URL domain)
Challenges in STR (cont.) • Challenge 3 (Retrieval Model): What is an appropriate model for STR? • Ranking vs. selection • Boolean model may be more powerful with structures (e.g., domain constraints in Web search) • But, ranking may still be desirable • Multiple (potentially conflicting) preferences • Text-based ranking preferences (traditional IR) • Structure-based ranking preferences (e.g., PageRank) • Structure-based constraints (traditional DB) • Text-based constraints (e.g., ) • How to combine them?
Challenges in STR (cont.) • Challenge 4 (Learning): How to learn a user’s information need over time or from examples? • How to support relevance feedback? • What about pseudo feedback? • How to exploit structures in text mining (unsupervised learning) • Challenge 5 (Efficiency): How to index and search a structured text collection efficiently? • Time and space • Extensibility
These challenges are mostly still open questions….We now look at some retrieval models that exploit intra-document and inter-document structures….
Exploiting Intra-document Structures[Ogilvie & Callan 2003] Select Dj and generate a query word using Dj “part selection” prob. Serves as weight for Dj Can be trained using EM Anchor text can be treated as a “part” of a document D Intuitively, we want to combine all the parts, but give more weights to some parts Think about query-likelihood model… Title Abstract Body-Part1 Body-Part2 … D1 D2 D3 Dk
Exploiting Inter-document Structures • Document collection has links (e.g., Web, citations of literature) • Query: text query • Results: ranked list of documents • Challenge: how to exploit links to improve ranking?
Exploiting Inter-Document Links “Extra text”/summary for a doc Authority Hub Description (“anchor text”) Links indicate the utility of a doc What does a link tell us?
PageRank: Capturing Page “Popularity[Page & Brin 98] • Intuitions • Links are like citations in literature • A page that is cited often can be expected to be more useful in general • PageRank is essentially “citation counting”, but improves over simple counting • Consider “indirect citations” (being cited by a highly cited paper counts a lot…) • Smoothing of citations (every page is assumed to have a non-zero citation count) • PageRank can also be interpreted as random surfing (thus capturing popularity)
The PageRank Algorithm (Page et al. 98) Random surfing model: At any page, With prob. , randomly jumping to a page With prob. (1-), randomly picking a link to follow. d1 “Transition matrix” N= # pages d3 d2 d4 Stationary (“stable”) distribution, so we ignore time Iij = 1/N Initial value p(d)=1/N Iterate until converge
PageRank in Practice • Interpretation of the damping factor (0.15): • Probability of a random jump • Smoothing the transition matrix (avoid zero’s) • Normalization doesn’t affect ranking, leading to some variants • The zero-inlink problem: p(di)’s don’t sum to 1 • One possible solution = page-specific damping factor (=1.0 for a page with no inlink)
HITS: Capturing Authorities & Hubs[Kleinberg 98] • Intuitions • Pages that are widely cited are good authorities • Pages that cite many other pages are good hubs • The key idea of HITS • Good authorities are cited by good hubs • Good hubs point to good authorities • Iterative reinforcement…
The HITS Algorithm [Kleinberg 98] “Adjacency matrix” d1 d3 Initial values: a=h=1 d2 Iterate d4 Normalize:
The problem of how to combine rank-based scores with content-based scores is still unsolved.
Next Generation Search Engines • More specialized/customized • Special group of users (community engines, e.g., Citeseer) • Personalized • Special genre/domain • Learning over time (evolving) • Integration of search, navigation, and recommendation/filtering (full-fledged information management)
What Should You Know • Special characteristics of Web information • Major challenges in Web information management • Basic search engine technology (crawling, indexing, ranking) • How PageRank and HITS work