1 / 31

Web Characterization

Explore the early days of the web, web crawl challenges, duplicate content detection, and the Deep Web. Dive into the evolution of web content, challenges of web crawling, and the vast world beyond surface web pages.

bmary
Download Presentation

Web Characterization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Web Characterization Week 11 LBSC 690 Information Technology

  2. The Why of the Web (in 1995) • Affordable storage • 300,000 words/$ • Adequate backbone capacity • 25,000 simultaneous transfers • Adequate “last mile” bandwidth • 1 second/screen • Display capability • 10% of US population • Effective search capabilities • Lycos, Yahoo

  3. Defining the Web • HTTP, HTML, or URL? • Static, dynamic or streaming? • Public, protected, or internal?

  4. Number of Web Sites

  5. Discussion Topic:What’s a Web “Site”? • OCLC counted any server at port 80 • Misses many servers at other ports • Some servers host unrelated content • Geocities • Some content requires specialized servers • rtsp

  6. Crawling the Web

  7. Web Crawl Challenges • Discovering “islands” and “peninsulas” • Duplicate and near-duplicate content • 30-40% of total content • Server and network loads • Dynamic content generation • Link rot • Changes at 1% per week • Temporary server interruptions

  8. Link Structure of the Web

  9. Duplicate Detection • Structural • Identical directory structure (e.g., mirrors, aliases) • Syntactic • Identical bytes • Identical markup (HTML, XML, …) • Semantic • Identical content • Similar content (e.g., with a different banner ad) • Related content (e.g., translated)

  10. Robots Exclusion Protocol • Requires voluntary compliance by crawlers • Exclusion by site • Create a robots.txt file at the server’s top level • Indicate which directories not to crawl • Exclusion by document (in HTML head) • Not implemented by all crawlers <meta name="robots“ content="noindex,nofollow">

  11. Hands on:The Internet Archive • alexa.com Web crawls since 1997 • http://archive.org • Check out Maryland’s Web site in 1997 • Check out the history of your favorite site

  12. Discussion Point • Can we save everything? • Should we? • Do people have a right to remove things?

  13. The “Deep Web” • Dynamic pages, generated from databases • Not easily discovered using crawling • Perhaps 400-500 times larger than surface Web • Fastest growing source of new information

  14. Content of the Deep Web

  15. Name Type URL Web Size (GBs) National Climatic Data Center (NOAA) Public http://www.ncdc.noaa.gov/ol/satellite/satelliteresources.html 366,000 NASA EOSDIS Public http://harp.gsfc.nasa.gov/~imswww/pub/imswelcome/plain.html 219,600 National Oceanographic (combined with Geophysical) Data Center (NOAA) Public/Fee http://www.nodc.noaa.gov/, http://www.ngdc.noaa.gov/ 32,940 Alexa Public (partial) http://www.alexa.com/ 15,860 Right-to-Know Network (RTK Net) Public http://www.rtk.net/ 14,640 MP3.com Public http://www.mp3.com/ Deep Web • 60 Deep Sites Exceed Surface Web by 40 Times

  16. Source: James Crawford, http://ourworld.compuserve.com/homepages/JWCRAWFORD/can-pop.htm

  17. Global Internet Users Native speakers, Global Reach projection for 2004 (as of Sept, 2003)

  18. Global Internet Users Web Pages Native speakers, Global Reach projection for 2004 (as of Sept, 2003)

  19. World Trade in 2001 Source: World Trade Organization

  20. European Web Content Source: European Commission, Evolution of the Internet and the World Wide Web in Europe, 1997

  21. Blogs Doubling 18.9 Million Weblogs Tracked Doubling in size approx. every 5 months Consistent doubling over the last 36 months Doubling Doubling Doubling

  22. Blue = Mainstream Media Red = Blog Challenge: Fight, or Embrace?

  23. Daily Posting Volume 1.2 Million legitimate Posts/Day Spam posts marked in red On average, additional 5.8% are spam posts Some spam spikes as high as 18% Katrina London Bombings Justice O’Connor Live 8 Concerts Deepthroat Revealed Kryptonite Lock Controversy Newsweek Koran Schiavo Dies US Election Day Superbowl Indian Ocean Tsunami

  24. A Web of Speech?

  25. Rethinking the Spoken Word • Speech is better for some things than writing • Spoken bits are as persistent as written bits • Storage costs is 80 times more than text • Disk cost falls by a factor of 80 in ~16 years • If speech is searchable, we will keep lots of it

  26. A Little Math • Collectable spoken words ≈ 10 Tw/day • 1 billion users * 100 words/min * 200 min/day / 2 • Compressed speech ≈ 2 words/kiloByte • (100/60 w/sec) * (6.5 kb/sec / 8 b/B) • Required storage ≈ 5 PetaBytes/day

  27. A Little Math • Collectable spoken words ≈ 10 Tw/day • 1 billion users * 100 words/min * 200 min/day / 2 • Compressed speech ≈ 2 words/kiloByte • (100/60 w/sec) * (6.5 kb/sec / 8 b/B) • Required storage ≈ 5 PetaBytes/day • Storage array sales > 5 PB/day • 457 PB in 2Q 2005 (increasing 59% per year) • $22/person/year (decreasing at 31%/year) Source: IDC Worldwide Disk Storage Systems Tracker, 2Q 2005

  28. Human Future Writing and Speech Human History Oral Tradition Writing

  29. Hands On: Speech on the Web • singingfish.com • blinkx.com • ocw.mit.edu • podcasts.yahoo.com

More Related