Clustering of Web Content for Efficient Replication

Clustering of Web Content for Efficient Replication Yan Chen, Lili Qiu, Wei Chen, Luan Nguyen and Randy H. Katz {yanchen, wychen, luann, randy} @eecs.berkeley.edu liliq@microsoft.com

Introduction • CDN (Content Distribution Networks) improves Web performance by replicating contents to close to the clients Greedy algorithm is proved to be efficient and effective for static replica placement to reduce the response latency of end users • Problem: What content to be replicated? All previous work assume replication of the whole Website. Per-URL scheme yields 60-70% reduction in clients’ latency, but too expensive • Goal: To exploit the tradeoff so performance can be improved significantly without high overhead • Our Solution: • Hot data analysis to filter out infrequently used data • Cluster URLs based on access pattern & replicate in unit of clusters • Incremental clustering + redistribution to adapt to the emerging URLs and changes of clients’ access pattern

Related Work • Qiu et al. and Jamin et al. independently reported a greedy algorithm is close to optimal for static replica placement • Lots of work on clustering Web content, however, focused on analyses of individual client access patterns • In contrast, we are more interested in aggregated clients • Among the first to use stability and performance as figure of merits for Web content clustering Problem formulation: Minimize the total latency of clients: subject to the constraint that the total replication cost, , is bounded by R, where |u| denotes the number of replicas

Problem Setup • Network Topology: • Pure-random & transit-Stub models from GT-ITM • A real AS-level topology from 7 widely-dispersed BGP peers • Real world traces: -- Cluster MSNBC Web clients with BGP prefix - BGP tables from a BBNPlanet router on 01/24/2001 - 10K clusters left, chooses top 10% covering >70% of requests --Cluster NASA Web clients with domain names --Map the client clusters randomly onto the topology

Hot Data Stability Analysis • Top 10% of URLs cover over 85% of requests • Hot data remain stable for reasonably long time -- Top 10% URLs on a given day cover over 80% of requests for at least the subsequent week • Conclusion: --Only hot data need to be considered for replication M S N B C Stability of requests coverage Stability of popularity ranking

Replica Placement Strategy • Replication Unit: per-website, per-URL, cluster of URLs Where M: number of hot objects R: number of replicas/URL K: number of clusters C: number of clients S: number of CDN servers fp: placement adaptation frequency fc: clustering frequency • Big performance gap betweenper-Website and per-URL • Clustering enables smooth tradeoff between cost and performance • Directory-based clustering only provides marginal improvement M S N B C

Replica Placement Algorithm • Greedy search: Iteratively choose <object, location> pairs that gives largest performance gain per URL for replication -- Object could be individual URL or URL clusters General Clustering Framework • Two steps: -- Define correlation distance between each pair of URLs -- Apply generic clustering methods below • Generic clustering algorithms: -- Algorithm 1: Limit the diameter (max distance between any two URLs) of a cluster, and minimize number of clusters -- Algorithm 2: Limit the number of clusters, then minimize the max diameter of all clusters

Correlation Distance • Spatial Clustering: --Represent the access distribution of a URL using a spatial access vector of K (number of client clusters) dimensions --Correlation distance defined as: 1. Euclidean distance between two spatial access vectors in K-dimension space 2. Vector similarity of two spatial access vectors A & B: • Temporal Clustering: -- Divide user requests into sessions, and analyze the access patterns in each session -- Correlation distance defined as:

Performance of Cluster-based Replication • Performance: Spatial clustering> spatial clustering with similarity> temporal clustering • With only 1-2% of cost of URL-based scheme, achieves performance close to URL-based replication a) with 5 replicas/URL Performance of various clustering approaches for MSNBC 8/1/99 trace b) Can run up to 50 replicas/URL

Stability of Clustering-based Replication • Determine the frequency for re-clustering/replicating • Static Clustering: -- Performance gap mostly due to the emerging URLs (1) Both clusters and replica locations based on old traces (2) Clusters based on old traces and replica locations based on new traces (3) Both clusters and replica locations based on new traces • Incremental Clustering: -- Reclaim the space of cold URLs/clusters -- Assign new URLs to existing clusters if correlation match & replicate -- Generate new clusters for the remaining new URLs & replicate M S N B C

Backup Slides • (1) currReplicationCost = 0 • (2) Initially, all the URLs reside at the origin Web servers • (3) currReplicationCost = totalURL • (4) For each URL, we find its best replications location, and the • amount of reduction in cost if the URL were replicated to • that application • (5) While (currReplicationCost < maxReplicationCost) • { • Choose the URL that has the largest reduction in cost, and • replicate the URL to the designed node • For that URL, we find its best replication location, and the • amount of reduction in cost if the URL were replicated to that location • currReplicationCost++ • }

LimitDiameterClustering-Greedy(Uncovered_point N) • While(N is not empty)\ • { • Choose s N such that the K-dimension ball centered at s with radius covers the largest number of URLs in N • Output the new cluster Ns, which consists of all URLs • covers by the K-dimension ball centered at s with radius • N = N – Ns • }

Clustering of Web Content for Efficient Replication

Clustering of Web Content for Efficient Replication

Presentation Transcript

CURE: An Efficient Clustering Algorithm for Large Databases

Web Document Clustering

Clustering for web documents

Cimbiosys : A platform for content-based partial replication

Web content

Web Document Clustering

Oracle Clustering and Replication Technologies

Database replication policies for dynamic content applications

Efficient Clustering of High-Dimensional Data Sets

Clustering Web Queries

Clustering Web Content for Efficient Replication

Database Replication Policies for Dynamic Content Applications

Efficient and Adaptive Replication using Content Clustering

Web Service Clustering

Web clustering Engines

Efficient Usage of Git for Web Developers

Clustering Web Content for Efficient Replication

Web Document Clustering

Efficient Algorithms for Non-Parametric Clustering With Clutter

Efficient Algorithms for Non-parametric Clustering With Clutter

Efficient and Adaptive Replication using Content Clustering

Clustering of Web pages