280 likes | 603 Views
Characterization of Search Engine Caches. Frank McCown & Michael L. Nelson Old Dominion University Norfolk, Virginia, USA Arlington, Virginia May 22, 2007. Outline. Preserving and caching the Web Lazy preservation Search engine sampling experiment.
E N D
Characterization of Search Engine Caches Frank McCown & Michael L. Nelson Old Dominion UniversityNorfolk, Virginia, USA Arlington, VirginiaMay 22, 2007
Outline • Preserving and caching the Web • Lazy preservation • Search engine sampling experiment
Black hat: http://img.webpronews.com/securitypronews/110705blackhat.jpgVirus image: http://polarboing.com/images/topics/misc/story.computer.virus_1137794805.jpg Hard drive: http://www.datarecoveryspecialist.com/images/head-crash-2.jpg
Preservation: Fortress Model 5 easy steps for preservation: • Get a lot of $ • Buy a lot of disks, machines, tapes, etc. • Hire an army of staff • Load a small amount of data • “Look upon my archive ye Mighty, and despair!” Slide from: http://www.cs.odu.edu/~mln/pubs/differently.ppt Image from: http://www.itunisie.com/tourisme/excursion/tabarka/images/fort.jpg
Internet Archive? How much of the Web is indexed? Estimates from “The Indexable Web is More than 11.5 billion pages” by Gulli and Signorini (WWW’05)
Alternative Models of Preservation • Lazy Preservation • Let Google, IA et al. preserve your website • Just-In-Time Preservation • Wait for it to disappear first, then a “good enough” version • Shared Infrastructure Preservation • Push your content to sites that might preserve it • Web Server Enhanced Preservation • Use Apache modules to create archival-ready resources
Cached PDF http://www.fda.gov/cder/about/whatwedo/testtube.pdf canonical MSN version Yahoo version Google version
Frank McCown, Amine Benjelloun, and Michael L. Nelson. Brass: A Queueing Manager for Warrick. 7th International Web Archiving Workshop (IWAW 2007). To appear. • Frank McCown, Norou Diawara, and Michael L. Nelson. Factors Affecting Website Reconstruction from the Web Infrastructure. ACM IEEE Joint Conference on Digital Libraries (JCDL 2007). To appear. • Frank McCown and Michael L. Nelson. Evaluation of Crawling Policies for a Web-Repository Crawler. 17th ACM Conference on Hypertext and Hypermedia (HYPERTEXT 2006) • Frank McCown, Joan A. Smith, Michael L. Nelson, and Johan Bollen. Lazy Preservation: Reconstructing Websites by Crawling the Crawlers. 8th ACM International Workshop on Web Information and Data Management (WIDM 2006) Available for download at http://www.cs.odu.edu/~fmccown/warrick/
Experiment: Sample Search Engine Caches • Feb 2006 • Submitted 5200 one-term queries to Ask, Google, MSN, and Yahoo • Randomly selected 1 result from first 100 • Download resource and cached page • Check for overlap with Internet Archive
976 KB 977 KB 215 KB 1 MB Cached Resource Size Distributions
Cache Freshness Fresh Stale Fresh time crawled and cached changed on web server crawled and cached Staleness = max(0, Last-Modified http header – cached date)
Cache Staleness • 46% of resources had Last-Modified header • 71% also had cached date • 16% were at least 1 day stale
Similarity • Compared live web resource with cached counterpart using shingling • Shingling – ratio of unique, shared, contiguous subsequences of tokens in a document • 19% of all resources have identical shingles • 21% of HTML resources have identical shingles • Resources shared 72% of their shingles on average
Conclusions • Ask is not useful (9% of resources cached) • Approximately 85% of indexed content is available in SE caches • All search engines appear to cache TLDs and different MIME types at the same rate • IA contains only 46% of the resources available in SE caches • Approximately 7% of indexed resources are missing from SE caches and IA
Thank You Frank McCown fmccown@cs.odu.edu http://www.cs.odu.edu/~fmccown/