Clustering Web Content for Efficient Replication

Clustering Web Content for Efficient Replication Yan Chen, Lili Qiu*, Weiyu Chen, Luan Nguyen, Randy H. Katz EECS Department UC Berkeley *Microsoft Research

Motivation • Amazing growth in WWW traffic • Daily growth of roughly 7M Web pages • Annual growth of 200% predicted for next 4 years • Content Distribution Network (CDN) commercialized to improve Web performance • Un-cooperative pull-based replication • Paradigm shift: cooperative push more cost-effective • Push replicas with greedy algorithms can achieve close to optimal performance [JJKRS01, QPV01] • Improving availability during flash crowds and disasters • Orthogonal issue: scalability • Per Website? Per URL? -> Clustering! • Clustering based onaggregatedclients’ access patterns • Adapt to users’ dynamic access patterns • Incremental clustering (online and offline)

Outlines • Motivation • Architecture • Related Work • Problem Formulation • Simulation methodology • Granularity of replication • Dynamic clustering and replication • Conclusions

5.GET request 6.GET request if cache miss 7. Response Local CDN server Local CDN server 4. local CDN server IP address 8. Response 1. GET request 2. Request for hostname resolution Web content server 3. Reply: local CDN server IP address Local DNS server Client 2 CDN name server Local DNS server Conventional CDN: Un-cooperative Pull Client 1 ISP 1 Big waste of replication! ISP 2

5.GET request if no replica yet 6. Response Local CDN server Local CDN server 4. Redirected server IP address 1. GET request 2. Request for hostname resolution Web content server 3. Reply: nearby replica server or Web server IP address 0. Push replicas 6. Response Local DNS server Client 2 5.GET request Local DNS server Cooperative Push-based CDN Client 1 CDN name server ISP 1 Significantly reduce # of replicas and consequently, the update cost (only 4% of un-coop pull) ISP 2

Related Work • Many existing work model replica placement as NP-hard problem and propose greedy algorithms • Ignore scalability problem • Clustering of Web contents based on individuals’ access patterns for • Pre-fetching, Web organization, etc. • Little on the dynamics of replica placement / clustering

Problem Formulation • Subject to the total replication cost (e.g., # of URL replicas) • Find a scalable, adaptive replication strategy to reduce avg access cost

Simulation Methodology • Network Topology • Pure-random, Waxman & transit-stub models from GT-ITM • A real AS-level topology from 7 widely-dispersed BGP peers • Web Workload • Aggregate MSNBC Web clients with BGP prefix • BGP tables from a BBNPlanet router • 10K groups left, chooses top 10% covering >70% of requests • Aggregate NASA Web clients with domain names • Map the client groups onto the topology • Performance Metric: average retrieval cost • Sum of edge costs from client to its closest replica

1 2 Per URL 4 3 1 Per Web site 2 4 3

Replica Placement: Per Website vs. Per URL • 60 – 70% average retrieval cost reduction for Per URL scheme • Per URL is too expensive for management! Where R: # of replicas/URL K: # of clusters M: # of URLs (M >> K) C: # of clients S: # of CDN servers f: placement adaptation frequency

Clustering Web Content • General clustering framework • Define the correlation distance between URLs • Cluster diameter: the max distance b/t any two members • Worst correlation in a cluster • Generic clustering: minimize the max diameter of all clusters • Correlation distance definition based on • Spatial locality • Temporal locality • Popularity • Semantics (e.g., directory)

1 2 4 3 Spatial Clustering • URL spatial access vector • Blue URL • Correlation distance between two URLs defined as • Euclidean distance • Vector similarity

Clustering Web Content (cont’d) • Temporal clustering • Divide traces into multiple individuals’ access sessions [ABQ01] • In each session, • Average over multiple sessions in one day • Popularity-based clustering • OR even simpler, sort them and put the first N/K elements into the first cluster, etc. - binary correlation

Performance of Cluster-based Replication • Tested over various topologies and traces • Spatial clustering with Euclidean distance and popularity-based clustering perform the best • Even small # of clusters (with only 1-2% of # of URLs) can achieve close to per-URL performance, with much less overhead MSNBC, 8/2/1999, 5 replicas/URL

Effects of the Non-Uniform Size of URLs 1 • Replication cost constraint : bytes • Similar trends exist • Per URL based replication outperforms per Website dramatically • Spatial clustering with Euclidean distance and popularity-based clustering are very cost-effective 2 4 3

Outlines • Motivation • Architecture • Related Work • Problem Formulation • Simulation methodology • Granularity of replication • Dynamic clustering and replication • Static clustering • Incremental clustering • Conclusions

Static clustering and replication • Two daily traces: old traceand new trace • Static clustering performs poorly beyond a week • Average retrieval cost almost doubles

Incremental Clustering • Generic framework • If new URL u match with existing clusters c, add u to c and replicate u to existing replicas of c • Else create new clusters and replicate them • Online incremental clustering • Push before accessed -> high availability • Predict access patterns based on semantics • Simplify to popularity prediction • Groups of URLs with similar popularity? Use hyperlink structures! • Groups of siblings • Groups of the same hyperlink depth: smallest # of links from root

access freq span= Online Popularity Prediction • Measure the divergence of URL popularity within a group: • Experiments • Use WebReaper to crawl http://www.msnbc.com on 5/3/2002 with hyperlink depth 4, then group the URLs • Use corresponding access logs to analyze the correlation • Groups of siblings has the best correlation

1 1 2 3 4 5 6 6 5 1 3 4 2 4 3 + ? 2 6 5 Online Incremental Clustering • Semantics-based incremental clustering • Put new URL into existing clusters with largest # of siblings • When there is a tie, choose the cluster with more replicas • Simulation on 5/3/2002 MSNBC • 8-10am trace: static popularity clustering + replication • At 10am: 16 new URLs emerged - online incremental clustering + replication • Evaluation with 10-12am trace: 16 URLs has 33,262 requests

Online Incremental Clustering & Replication Results

u s c r Offline Incremental Clustering • Assume access history as input • Study spatial clustering and popularity-based clustering • For instance, spatial clustering with Euclidean distance • Find the closest c for new URL u • Match if (s < r) • More than 98% new URLs match with old clusters • Cluster orphan URLs with diameter of dmax • Replicate them with the average replicas/URL

Offline Incremental Clustering Results • Performance close to optimal • With only 25-45% replication cost

Conclusions • CDN operators:cooperative, clustering-based replication • Cooperative: big savings on replica management and update cost • Per URL replication outperforms per Website scheme by 60-70% • Clustering solves the scalability issues, and gives the full spectrum of flexibility • Spatial clustering and popularity-based clustering recommended • To adapt to users’ access patterns: incremental clustering • Hyperlink-based online incremental clustering for • High availability • Performance improvement • Offline incremental clustering performs close to optimal

Performance of Cluster-based Replication • Tested over various topologies and traces • Spatial clustering with Euclidean distance and popularity-based clustering perform the best • Even small # of clusters (with only 1-2% of # of URLs) can achieve close to per-URL performance MSNBC, 8/2/1999, 5 replicas/URL NASA, 7/1/1995, 3 replicas/URL

Clustering Web Content for Efficient Replication