190 likes | 205 Views
Learn about web crawlers, robots exclusion, automatic indexing, and search engine algorithms for efficient web searching. Dive into the intricacies of web indexing and explore the complexities of web search engines' operations.
E N D
Lecture 16 Web search engines CS 502: Computing Methods for Digital Libraries
Administration Modem cards for laptops Collect from Upson 311 Assignment 3 Due April 4 at 10 p.m.
Web Crawlers • Builds an index of web pages by repeating a few basic steps: • Maintains a list of known URLs, whether or not the corresponding pages have yet been indexed. • Selects the URL of an HTML page that has not been indexed. • Retrieves the page and brings it back to a central computer. • Automatic indexing program creates an index record, which is added to the overall index. • Hyperlinks from the page to other pages are added to the list of URLs for future exploration.
Web Crawlers Design questions: What to collect Complex web sites Dynamic pages How fast to collect Frequency of sweep How often to try How to manage parallel crawlers
Robots Exclusion Example file: /robots.txt # robots.txt for http://www.example.com/ User-agent: * Disallow: /cyberworld/map/ Disallow: /tmp/ # these will soon disappear Disallow: /foo.html # Cybermapper knows where to go. User-agent: cybermapper Disallow:
Automatic Indexing • Automatic indexing at its most basic: • millions of pages • created by thousands of people • with different concepts of how information should be • structured. • Typical web pages provide meager clues for automatic indexing. • Some creators and publishers are even deliberately misleading; they fill their pages with terms that are likely to be requested by users.
An Example: AltaVista 1997 • Digital library concepts • Key Concepts in the Architecture of the Digital • Library. William Y. Arms Corporation for • National Research Initiatives Reston, Virginia... • http://www.dlib.org/dlib/July95/07arms.html - • size 16K - 7-Oct-96 - English • Repository References • Notice: HyperNews at union.ncsa.uiuc.edu will • be moving to a new machine and domain very • soon. Expect interruptions. Repository • References. This is a page. • http://union.ncsa.uiuc.edu/HyperNews/get/www/repo/references.html • - size 5K - 12-May-95 - English
Meta Tags • Elements within the HTML <head> • <meta name="publisher" content="OCLC"> • <meta name="creator" content="Weibel, Stuart L."> • <meta name="creator" content="Miller, Eric J."> • <meta name="title" content="Dublin Core Reference Page"> • <meta name="date" content="1996-05-28"> • <meta name="form" content="text/html"> • <meta name="language" content="en">
Searching the Web Index • Web search programs use standard methods of • information retrieval: • Index records are of low quality. • Users are untrained • -> search programs identify all records that vaguely match the query • -> supply them to the user in ranked order • Indexes are organized for efficient searching by large numbers of simultaneous users.
Searching the Web Index • Difficulties: • User interface • Duplicate elimination • Ranking algorithms
Page Ranks (Google) Citing page P1 P2 P3 P4 P5 P6 P1 1 1 1 P2 1 P3 1 P4 1 1 1 1 P5 1 P6 1 1 Cited page Number 2 1 4 1 2 2
Normalize by Number of Links from Page Citing page P1 P2 P3 P4 P5 P6 P1 1 P2 P3 P4 1 P5 P6 = B Cited page Number 2 1 4 1 2 2
Weighting of Pages Initially all pages have weight 1 w1 = Recalculate weights w2 = Bw1 = Iterate until 1 2 1 1 1 1 1 1 w = Bw
Google Ranks • w is the high order eigenvector of B • It ranks the pages by links to them normalized by the number of citations from each page and weighted by the ranking of the cited pages • Google: • calculates the ranks for all pages (about 450 million) • lists hits in rank order
Computer Science Research Academic research Industrial R&D Entrepreneurs
Example: Web Search Engines • Lycos (Mauldin, Carnegie Mellon) • Technical basis: • Research in text-skimming (Ph.D. thesis) • Pursuit free text retrieval engine (TREC) • Robot exclusion research (private interest) • Organizational basis: • Center for Machine Translation • Grant flexibility (DARPA)
Example: Web Search Engines • Google (Page and Brin, Stanford) • Technical basis: • Research in ranking hyperlinks (Ph.D. research) • Organizational basis: • Grant flexibility (NSF Digital Libraries Initiative) • Equipment grant (Hewlett Packard)
The Internet Graph • Theoretical research in graph theory • Six degrees of separation • Pareto distributions • Algorithms • Hubs and authorities (Kleinberg, Cornell) • Empirical data • Commercial (Yahoo!, Google, Alexa, AltaVista, Lycos) • Not-for-profit (Internet Archive)
Google Statistics • The central system handles 5.5 million searches daily, increasing 20% per month. • 2,500 PCs running Linux; 80 terabytes of spinning disk; an average of 30 new machines per day. • The cache holds about 200 million html pages. • The aim is to crawl the web once per month. • 85 people; half are technical; 14 have a Ph.D. in computer science. • Comparison: Yahoo! has 100,000,000 registered users and • dispatches 1/2 billion pages to users per day.