CS 349: WebBase 1

CS 349: WebBase 1 What the WebBase can and can’t do?

Summary • What is in the WebBase • Performance Considerations • Reliability Considerations • Other sources of data

WebBase Repository • 25 million web pages • 150 GB (50 GB compressed) • spread across roughly 30 disks

What kind of web pages? • Everything you can imagine • Infinite Web Pages (truncated at 100K) • 404 Errors • Very little correct HTML.

Duplicate Web Pages • Duplicate Sites • Crawl Root Pages first • Find duplicates and assume same for remainder of crawl • Duplicate Hierarchies off of main page • Mirror Sites • Duplicate Pages • Near Duplicate Pages

Shiva’s Test Results • 36% Duplicates • 48% Near Duplicates • Largest Sets of Duplicates: • TUCOWS (100) • MS IE Server Manuals (90) • Unix Help Pages (75) • RedHat Linux Manual (55) • Java API Doc (50)

Order of Web Pages • First half million are root pages • After that, pages in PageRank order • Roughly by importance

Structure of Data • magic number (4 bytes), packet length (4 bytes), packet (~2K bytes) • packet is compressed • packet contains: docID, URL, HTTP Headers, HTML data

Performance Issues: An Example • One Disk Seek Per Document: • 10 ms seek latency +10 ms rotational latency • x ms read latency + x ms OS overhead • y ms processing • Realistically 50 ms per document = 20 docs per second • 25 million / 20 docs per second = 1,250,000 seconds = 2 weeks (too slow)

How fast does it have to be? • Answer: 4ms per doc = 250 docs per second • 25 million / 250 = 100,000 seconds = 1.2 days • Reading + Uncompressing + Parsing == ~3 to 4 ms per document • So there is not much room left for processing

How can you do something complicated? • Really fast processing to generate some intermediate smaller results. • Run complex processing over smaller results. • Example: Duplicate Detection • Compute shingles from all documents • Find pairs of documents that share shingles

Bulk Processing of Large Result Sets • Example: Resolving Anchors • Resolve URLs and save from - to in ASCII • Compute 64 bit checksum of “To” URLs • Bulk Merge against checksum - docid table

Reliability - Potential Sources of Problems • Source code bug. • Hardware failure • OS failure • Out of resources

Software Engineering Guidelines • Number of bugs seen ~ log(size of dataset) • Not just your bugs • OS bugs • Disk OS bugs • Generate incremental results

Other Available Data • Link Graph of the Web • List of PageRanks • List of URLs

Link Graph of the Web • From DocID : To DocID • Try red bars on Google to find backlinks • Interesting Information

What is PageRank • Measure of “importance” • You are important if important things point to you • Random surfer model

Uncrawled URLs • Image links • MailTo links • CGI links • Plain uncrawled HTML links

Summary • WebBase has lots of web pages • very heterogeneous and weird • Performance Considerations • code should be very very fast • use bulk processing • Reliability Considerations • write out intermediate results • Auxiliary data

CS 349: WebBase 1

CS 349: WebBase 1

Presentation Transcript

CS 1

CS 349: Market Basket Data Mining

CS 1

CS 240 Week 1

CS 349: Contemporary Issues in CS CS 301k: Foundations of Logical Thought

CS 1

Management 349

EDUC 349 Chapter 1

Introduction to User Interfaces: CS 349

CS 1

CS 242 – Programming (1)

CS 121 – Quiz 1

HW: p. 349 1 – 11

Pg. 349

CS 241 – Programming (1)

WebBase: Building a Web Warehouse

EDUC 349

349

CS 322 week 1

WebBase : A repository of web pages

ACC 349 Week 1 Connect Assignment//tutorfortune.com

WebBase and Stanford Digital Library Project