WebBase: Building a Web Warehouse

WebBase:Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho,Taher Haveliwala, Jun Hirai, Glen Jeh, Andy Kacsmar, Sep Kamvar, Wang Lam, Larry Page, Andreas Paepcke,Sriram Raghavan, Gary Wesley

web The Web • A universal information resource • Model weak, strong agreement • How to exploit it?

WebBase WEB PAGE

WebBase Goals • Manage very large collections of Web pages • Today: 1500GB HTML, 200 M pages • Enable large-scale Web-related research • Locally provide a significant portion of the Web • Efficient wide-area Web data distribution

WebBase Architecture

WebBase Remote Users • Berkeley • Columbia • U. Washington • Harvey Mudd • Università degliStudi di Milano • U. of Arizona • California Digital Library • Cornell • U. of Houston • Learning LabLower Saxony (L3S) • France Telecom • U. Texas

Outline • Technical Challenges • WebBase Use • The Future

Challenges • Archiving • “units” • coordination • IP Management • copy access • link access • access control • Hidden Web • Topic-Specific Collection Building • Scalability • crawling • archive distribution • index construction • storage • Consistency • freshness • versions • Dissemination

What is a Crawler? initial urls init to visit urls get next url web get page visited urls extract urls web pages

C C C Parallel Crawling web ...

site 1   e b  a   d c  C C site 2   h g  f   i web Independent Crawlers

site 1  e b  a  d c  C C site 2   h g  f   i Partition: Firewall partition • URL hash • Site hash • Hierarchical

site 1  e b  a  d c  C C site 2   h g  f   i Partition: Cross-Over partition

site 1   e b  a   d c  C C site 2   h g  f   i Partition: Cross-Over partition

site 1 e b  a  d c  C C site 2   h g  f  i Partition: Exchange partition

site 1   e b  a   d c  C C site 2   h g  f   i Partition: Exchange partition

Coverage vs Overlap cross-over crawler; 5 random seeds per C-proc

process site queues ... process site queues ... WebBase Parallel Crawling computer coordinator web ... other computers

WebBase Parallel Crawling 2 cpu utilzation 200% 100% 0% number of processes

Challenges • Archiving • “units” • coordination • IP Management • copy access • link access • access control • Hidden Web • Topic-Specific Collection Building • Scalability • crawling • archive distribution • index construction • storage • Consistency • freshness • versions • Dissemination done next

How to Refresh? a a a changes daily can visit one page per week b b changes once a week b web repository • How should we visit pages? • a a a a a a a ... • b b b b b b b ... • a b a b a b a b ... [uniform] • a a a a a a b a a a ...[proportional] • ?

Using WebBase • Fast Page Rank • Complex Queries

Structure of the Web Color the nodes by their domain red = stanford.edu green = berkeley.edu blue = mit.edu

Structure of the Web berkeley.edu stanford.edu mit.edu

Nested Block Structure of the Web to Berkeley Stanford from

Personalized Page Rank a b

Bulk/Streaming access Large-scale mining & indexing E.g., compute PageRank, extract communities Complex Queries Text search E.g., Search for “SARS Symptoms” Stanford WebBase Repository Complex queries Declarative analysis interface

Rank pages in S by PageRank Rank domains in R by sum (incoming ranks) List top 10 domains in R Example of a Complex Query Web Entire Web Compute S = stanford.edu pages containing phrase “Mobile networking” stanford.edu Mobile networking pages (S) find universities collaborating with Stanford on mobile networking Compute R = set of all “.edu” domains pointed to by pages in S S R

Supernode graph N2 E3,2 E1,2 E1,3 N3 N1 E3,1 P1 P2 P4 P5 IntraNode1 IntraNode3 P2 P5 P3 P5 SEdgePos1,3 SEdgeNeg3,2 Supernodes N2 N1 P3 N3 P1 P4 P5 P2 Web graph  = {N1, N2, N3}

Growth of Supernode Graph 100 90 80 70 82MB, 115M pages (830 GB ofraw HTML) 60 Size of supernode graph (MB) 50 40 30 20 120 0 20 40 60 80 100 Number of pages (Millions)

Query Execution Times 600 S-Node representation Relational DB 500 Connectivity Server Files of adjacency lists 400 300 Time for navigation operation (secs) 200 100 0 Query 1 Query 2 Query 3 Query 4 Query 5 Query 6 Query

P P Query Optimization

Impact of cluster-based optimization 35-million page dataset 600 million links 300GB of HTML 40-45% reduction in query execution times

The Future for WebBase(and clones)?? Conclusion (So Far) • Web is universal information resource • WebBase exploits this resource • WebBase Challenges: • scalability, consitency, complex queries...

Will WebBase Scale? webBase capacity (optimistic) web content (indexable) webBase capacity (pesimistic) today time

Pessimistic Scenario • Specialized WebBases • sports • shopping • ... web content (indexable) webBase capacity (pesimistic) today time

Optimistic Scenario webBase capacity (optimistic) • Web in a Box • web delivered in “CD” monthy • search engine handles updates web content (indexable) today time

Legal Issues? • Is WebBase legal? • copies • links, deep linking • International regulations

Biasing Results • How long will Google, Altavista, etc.resist “temptations”? • Biasing Crawler • Link and Content Spam

web Access Data • WebBase does not capture access patterns WebBase ?

web Semantic Web? semantic tags • Will tags be generated? • By whom? • Agreement? WebBase ?

Future Technical Challenges • Incremental Updates • Query Optimization • Crawling Deep Web

WEB PAGE Final Conclusion • Many challenges ahead... • Additional information:Google: Stanford WebBase

WebBase: Building a Web Warehouse

WebBase: Building a Web Warehouse

Presentation Transcript

Building a Web Site

Building a Web Page

Building a Data Warehouse with SQL Server

Building a Web Browser

Building a Data Warehouse

Building Data Warehouse

Web-Enabling the Warehouse

Representation of Web Data in a Web Warehouse

Functions of a Web Warehouse

Building Data Warehouse at Rensselaer

Building a Web Page

Building a Data Warehouse for Business Reporting

Building a Web Page

Building a Web Page

Building a Data Warehouse

Web Building

Building the Warehouse

CS 349: WebBase 1

Building a Web Site

WebBase : A repository of web pages

Building a Web Site

Pi-Web Join in a Web Warehouse