430 likes | 519 Views
WebBase: Building a Web Warehouse. Hector Garcia-Molina Stanford University. Work with: Sergey Brin, Junghoo Cho, Taher Haveliwala, Jun Hirai, Glen Jeh, Andy Kacsmar, Sep Kamvar, Wang Lam, Larry Page, Andreas Paepcke, Sriram Raghavan, Gary Wesley. web. The Web.
E N D
WebBase:Building a Web Warehouse Hector Garcia-Molina Stanford University Work with: Sergey Brin, Junghoo Cho,Taher Haveliwala, Jun Hirai, Glen Jeh, Andy Kacsmar, Sep Kamvar, Wang Lam, Larry Page, Andreas Paepcke,Sriram Raghavan, Gary Wesley
web The Web • A universal information resource • Model weak, strong agreement • How to exploit it?
WebBase WEB PAGE
WebBase Goals • Manage very large collections of Web pages • Today: 1500GB HTML, 200 M pages • Enable large-scale Web-related research • Locally provide a significant portion of the Web • Efficient wide-area Web data distribution
WebBase Remote Users • Berkeley • Columbia • U. Washington • Harvey Mudd • Università degliStudi di Milano • U. of Arizona • California Digital Library • Cornell • U. of Houston • Learning LabLower Saxony (L3S) • France Telecom • U. Texas
Outline • Technical Challenges • WebBase Use • The Future
Challenges • Archiving • “units” • coordination • IP Management • copy access • link access • access control • Hidden Web • Topic-Specific Collection Building • Scalability • crawling • archive distribution • index construction • storage • Consistency • freshness • versions • Dissemination
What is a Crawler? initial urls init to visit urls get next url web get page visited urls extract urls web pages
C C C Parallel Crawling web ...
site 1 e b a d c C C site 2 h g f i web Independent Crawlers
site 1 e b a d c C C site 2 h g f i Partition: Firewall partition • URL hash • Site hash • Hierarchical
site 1 e b a d c C C site 2 h g f i Partition: Cross-Over partition
site 1 e b a d c C C site 2 h g f i Partition: Cross-Over partition
site 1 e b a d c C C site 2 h g f i Partition: Exchange partition
site 1 e b a d c C C site 2 h g f i Partition: Exchange partition
Coverage vs Overlap cross-over crawler; 5 random seeds per C-proc
process site queues ... process site queues ... WebBase Parallel Crawling computer coordinator web ... other computers
WebBase Parallel Crawling 2 cpu utilzation 200% 100% 0% number of processes
Challenges • Archiving • “units” • coordination • IP Management • copy access • link access • access control • Hidden Web • Topic-Specific Collection Building • Scalability • crawling • archive distribution • index construction • storage • Consistency • freshness • versions • Dissemination done next
How to Refresh? a a a changes daily can visit one page per week b b changes once a week b web repository • How should we visit pages? • a a a a a a a ... • b b b b b b b ... • a b a b a b a b ... [uniform] • a a a a a a b a a a ...[proportional] • ?
Using WebBase • Fast Page Rank • Complex Queries
Structure of the Web Color the nodes by their domain red = stanford.edu green = berkeley.edu blue = mit.edu
Structure of the Web berkeley.edu stanford.edu mit.edu
Nested Block Structure of the Web to Berkeley Stanford from
Bulk/Streaming access Large-scale mining & indexing E.g., compute PageRank, extract communities Complex Queries Text search E.g., Search for “SARS Symptoms” Stanford WebBase Repository Complex queries Declarative analysis interface
Rank pages in S by PageRank Rank domains in R by sum (incoming ranks) List top 10 domains in R Example of a Complex Query Web Entire Web Compute S = stanford.edu pages containing phrase “Mobile networking” stanford.edu Mobile networking pages (S) find universities collaborating with Stanford on mobile networking Compute R = set of all “.edu” domains pointed to by pages in S S R
Supernode graph N2 E3,2 E1,2 E1,3 N3 N1 E3,1 P1 P2 P4 P5 IntraNode1 IntraNode3 P2 P5 P3 P5 SEdgePos1,3 SEdgeNeg3,2 Supernodes N2 N1 P3 N3 P1 P4 P5 P2 Web graph = {N1, N2, N3}
Growth of Supernode Graph 100 90 80 70 82MB, 115M pages (830 GB ofraw HTML) 60 Size of supernode graph (MB) 50 40 30 20 120 0 20 40 60 80 100 Number of pages (Millions)
Query Execution Times 600 S-Node representation Relational DB 500 Connectivity Server Files of adjacency lists 400 300 Time for navigation operation (secs) 200 100 0 Query 1 Query 2 Query 3 Query 4 Query 5 Query 6 Query
P P Query Optimization
Impact of cluster-based optimization 35-million page dataset 600 million links 300GB of HTML 40-45% reduction in query execution times
The Future for WebBase(and clones)?? Conclusion (So Far) • Web is universal information resource • WebBase exploits this resource • WebBase Challenges: • scalability, consitency, complex queries...
Will WebBase Scale? webBase capacity (optimistic) web content (indexable) webBase capacity (pesimistic) today time
Pessimistic Scenario • Specialized WebBases • sports • shopping • ... web content (indexable) webBase capacity (pesimistic) today time
Optimistic Scenario webBase capacity (optimistic) • Web in a Box • web delivered in “CD” monthy • search engine handles updates web content (indexable) today time
Legal Issues? • Is WebBase legal? • copies • links, deep linking • International regulations
Biasing Results • How long will Google, Altavista, etc.resist “temptations”? • Biasing Crawler • Link and Content Spam
web Access Data • WebBase does not capture access patterns WebBase ?
web Semantic Web? semantic tags • Will tags be generated? • By whom? • Agreement? WebBase ?
Future Technical Challenges • Incremental Updates • Query Optimization • Crawling Deep Web
WEB PAGE Final Conclusion • Many challenges ahead... • Additional information:Google: Stanford WebBase