E N D
Slide 1:Web Searching
Web searching paradigms Search engine Page discovery Indexing Ranking Metasearcher Measuring search
Slide 2:Basic Search Paradigms
Querying User formulates a query, sends to search engine (server). Search engine processes query, and sends back a set of pages as response. Browsing User visits successive web pages. “Navigate” through the Web. Interleaved User sends an “approximate” query and obtains a set of pages from search engine, and browse from a page which seems to be most relevant. While browsing, user may formulate a “more accurate” query.
Slide 3:Basic Search Paradigms (Cont’d)
Web Directories A web page organized into a “table of content” style of a directory of main topics. An attempt to classify a portion of the Web. Alternative to querying/browsing. User scans the directory, may also set query within a directory. Information retrieval is usually of higher relevance (quality) Limiting the domain sharpens searching.
Slide 4:Search Engine
A server that Processes users’ queries and return links to relevant pages. Main functional components Web crawler (or spider) Crawls the web to discover pages. Indexer Extracts key-terms (keywords or phrases) from page and associates them with the link to the page. Produces data structure for efficient search. Query processor Processes user queries against the index. User interface Solicit user’s queries and present responses.
Slide 5:Search Engine Architecture
Users Web User Interface Web Crawler Query Processor Indexer Index
Slide 6:Web Crawler
Recursive crawl Start with a initial URL, get page, and get other links within the page. Different traversal modes Breath-first. Depth-first. Multiple crawlers Partition Web (say using domain names) and assign crawlers to different partitions. Crawling may load down Web servers Some server may restrict access from crawlers
Slide 7:Web Crawler (cont’d)
Revisit frequency Pages may get updated, deleted. Page owner submit Page submitted to search engine site. Crawler may start exhaustive crawl from that page. May search only to a limit depth.
Slide 8:Indexer
Scans page to extract keywords (or key-terms). Builds inverted list (or inverted file) For each keyword, a set of pointers to the pages (actually, links to the pages) where the word appears. The pointer to a page also include a weight An indication of the relevance of the page with respect to the keyword, an commonly used weight factor is the frequency count. Also included is some description of the page: title, size, few lines containing the keyword, etc. Typical storage for 100 million pages 50 Gbyte for page (URL) descriptions (at 500 bytes each) 150 Gbyte for the inverted list
Slide 9:Inverted List
: : 8 www.catf.. Catfish Institute .. next page 5 www.plann.... ..is a good fish .. next page Lexicon (or vocabulary) weight link (url) page description next page
Slide 10:Inverted List (Cont’d)
The inverted list is organized to optimize searching. The entries are sorted to allow Binary search: search time ~ log2N. log2(1 billion) ~ 30 Interpolation search: Search time ~ log2 (log2N) log2 log2(1 billion) ~ 5 Substantial processing overhead. The set of pages (links) corresponding to a keyword can be ordered by the weight. One commonly weight is the frequency count: the number of occurrences of the word in the page.
Slide 11:Indexing a Page
Scan page to extract keywords (and/or key-terms). Ignore stopwords (the, a, an, and, or, I, you, etc) 100 most frequent words ~ 50% of document. Stemming Replace all variants of a word with the single stem of the word. Communicate: communicates, communicating, communicated, communication, … Stopwords elimination and stemming reduce inverted list size and improve search speed. Various possibilities on deciding on weight of indexed words or terms Frequency count, appearance in title, etc., and combinations.
Slide 12:User Interface - Query Specification
Basic specification : Keywords Disjunction (OR), e.g., AltaVista Conjunction (AND), e.g., Google Advanced query interface Boolean operators: AND, OR, etc. Phrase match (phrase in quotes: e.g., “tender heart”) Proximity, wild cards. Filtering by date, internet domain, etc. In AltaVista, can specify terms by importance (separate from query specification) Content: multimedia, .PDF, .PPT files
Slide 13:Query Processing
Searching the inverted list List entries are in sorted order. Significantly reduce search time. Query terms processing Boolean: e.g. OR corresponding to union of search results, AND to intersection, etc. May require additional information stored for page (link): e.g., PROXIMITY requires storing positions of keywords. Ranking Determining the order (or priority) for presentation to user.
Slide 14:Ranking
Basic model Page (document) modeled as a vector of keyword-weight pairs: P = {(kw1, w2), (kw2, w2), …, (kwt, wt)} Query modeled as a “specification” for the desired page(s) (ideal answer to query): Q = {(kw1, u2), (kw2, u2), …, (kwt, ut)}. Ranking algorithm calculate a rank value = R(P, Q). An example R: R = ?i wiui Weight is used to rank page and can be made to depend on Presence of keyword in the title of the document. Frequency/count of keyword in document. Link popularity (how many other pages points to this one. and/or combinations.
Slide 15:Ranking (Cont’d)
Example – PageRank algorithm (used by Google) Link popularity is used to help rank a page. A link from page A to page B is interpreted as a vote (by A) for B. A vote cast by a page that is more “important” has higher rank (or weight) value and make the voted for page more “important”. Hence, the rank value of a page is based on the value of the pages that reference it. The rank also takes into consideration of other more tradition factors such as keyword frequency counts, etc.
Slide 16:Search Engines Differences
Coverage (number of documents) Web crawler algorithms Frequency and depth of visits Indexing algorithms Search interfaces Ranking algorithms
Slide 17:Search Engines Sizes
150 12 50 80 SEARCHES/DAY (MILLIONS) AV Altavista EX Excite FAST FAST GG Google Go Go (Infoseek) INK Inktomi NL Northern Light WT WebTop.com SHADED DATA FOR GG AND INKTOMI INCLUDES PAGES INDEXED BUT NOT VISITED SOURCE: SEARCHENGINEWATCH.COM Dec. 11, 2001
Slide 18:Metasearcher
A search server. Submit the same query to several search engines and collect the answers. Exploit efforts of many different search engines. Save user’s effort to send queries to multiple servers. A page that is retrieved by multiple search engines is likely to be more relevant. Improved coverage. Example: metacrawler, savvysearch.
Slide 19:Measuring Retrieval
Precision and Recall For each query, the page collection is partitioned by the answer of the search (see diagram) Precision = A/(A?C) Recall = A /(A?B) Precision Can be estimated Recall Difficult to estimate for large collection such as the Web, where the complete set (which comprise also those not retrieved) may not be known. A = relevant retrieved B = relevant not retrieved C = not relevant retrieved D = not relevant not retrieved
Slide 20:Observations
Search engines are effective tools for eCommerce Enable buyers and sellers to find each other. Allow your visitors to search your site. You may submit you pages to Web directories Search engine algorithms (especially ranking algorithms) are often proprietary A search engine usually does not cover the Web completely New paradigms should be developed In addition to keywords Link popularity Extension to images, audio, video, etc.
Slide 21:Web Searching
List of search engines http://www.searchenginecolossus.com/ Search engine resources http://www.pandia.com/resources/index.html Submitting a page/site to a Web directory http://dmoz.org/add.html Adding a search engine to your web site Using KSearch