1 / 19

CS 502: Computing Methods for Digital Libraries

Learn about web crawlers, robots exclusion, automatic indexing, and search engine algorithms for efficient web searching. Dive into the intricacies of web indexing and explore the complexities of web search engines' operations.

gullett
Download Presentation

CS 502: Computing Methods for Digital Libraries

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 16 Web search engines CS 502: Computing Methods for Digital Libraries

  2. Administration Modem cards for laptops Collect from Upson 311 Assignment 3 Due April 4 at 10 p.m.

  3. Web Crawlers • Builds an index of web pages by repeating a few basic steps: • Maintains a list of known URLs, whether or not the corresponding pages have yet been indexed. • Selects the URL of an HTML page that has not been indexed. • Retrieves the page and brings it back to a central computer. • Automatic indexing program creates an index record, which is added to the overall index. • Hyperlinks from the page to other pages are added to the list of URLs for future exploration.

  4. Web Crawlers Design questions: What to collect Complex web sites Dynamic pages How fast to collect Frequency of sweep How often to try How to manage parallel crawlers

  5. Robots Exclusion Example file: /robots.txt # robots.txt for http://www.example.com/ User-agent: * Disallow: /cyberworld/map/ Disallow: /tmp/ # these will soon disappear Disallow: /foo.html # Cybermapper knows where to go. User-agent: cybermapper Disallow:

  6. Automatic Indexing • Automatic indexing at its most basic: • millions of pages • created by thousands of people • with different concepts of how information should be • structured. • Typical web pages provide meager clues for automatic indexing. • Some creators and publishers are even deliberately misleading; they fill their pages with terms that are likely to be requested by users.

  7. An Example: AltaVista 1997 • Digital library concepts • Key Concepts in the Architecture of the Digital • Library. William Y. Arms Corporation for • National Research Initiatives Reston, Virginia... • http://www.dlib.org/dlib/July95/07arms.html - • size 16K - 7-Oct-96 - English • Repository References • Notice: HyperNews at union.ncsa.uiuc.edu will • be moving to a new machine and domain very • soon. Expect interruptions. Repository • References. This is a page. • http://union.ncsa.uiuc.edu/HyperNews/get/www/repo/references.html • - size 5K - 12-May-95 - English

  8. Meta Tags • Elements within the HTML <head> • <meta name="publisher" content="OCLC"> • <meta name="creator" content="Weibel, Stuart L."> • <meta name="creator" content="Miller, Eric J."> • <meta name="title" content="Dublin Core Reference Page"> • <meta name="date" content="1996-05-28"> • <meta name="form" content="text/html"> • <meta name="language" content="en">

  9. Searching the Web Index • Web search programs use standard methods of • information retrieval: • Index records are of low quality. • Users are untrained • -> search programs identify all records that vaguely match the query • -> supply them to the user in ranked order • Indexes are organized for efficient searching by large numbers of simultaneous users.

  10. Searching the Web Index • Difficulties: • User interface • Duplicate elimination • Ranking algorithms

  11. Page Ranks (Google) Citing page P1 P2 P3 P4 P5 P6 P1 1 1 1 P2 1 P3 1 P4 1 1 1 1 P5 1 P6 1 1 Cited page Number 2 1 4 1 2 2

  12. Normalize by Number of Links from Page Citing page P1 P2 P3 P4 P5 P6 P1 1  P2  P3 P4 1  P5 P6 = B Cited page Number 2 1 4 1 2 2

  13. Weighting of Pages Initially all pages have weight 1 w1 = Recalculate weights w2 = Bw1 = Iterate until 1   2   1 1 1 1 1 1 w = Bw

  14. Google Ranks • w is the high order eigenvector of B • It ranks the pages by links to them normalized by the number of citations from each page and weighted by the ranking of the cited pages • Google: • calculates the ranks for all pages (about 450 million) • lists hits in rank order

  15. Computer Science Research Academic research Industrial R&D Entrepreneurs

  16. Example: Web Search Engines • Lycos (Mauldin, Carnegie Mellon) • Technical basis: • Research in text-skimming (Ph.D. thesis) • Pursuit free text retrieval engine (TREC) • Robot exclusion research (private interest) • Organizational basis: • Center for Machine Translation • Grant flexibility (DARPA)

  17. Example: Web Search Engines • Google (Page and Brin, Stanford) • Technical basis: • Research in ranking hyperlinks (Ph.D. research) • Organizational basis: • Grant flexibility (NSF Digital Libraries Initiative) • Equipment grant (Hewlett Packard)

  18. The Internet Graph • Theoretical research in graph theory • Six degrees of separation • Pareto distributions • Algorithms • Hubs and authorities (Kleinberg, Cornell) • Empirical data • Commercial (Yahoo!, Google, Alexa, AltaVista, Lycos) • Not-for-profit (Internet Archive)

  19. Google Statistics • The central system handles 5.5 million searches daily, increasing 20% per month. • 2,500 PCs running Linux; 80 terabytes of spinning disk; an average of 30 new machines per day. • The cache holds about 200 million html pages. • The aim is to crawl the web once per month. • 85 people; half are technical; 14 have a Ph.D. in computer science. • Comparison: Yahoo! has 100,000,000 registered users and • dispatches 1/2 billion pages to users per day.

More Related