Efficient and Adaptive Replication using Content Clustering

Efficient and Adaptive Replication using Content Clustering Yan Chen EECS Department UC Berkeley

Motivation • The Internet has evolved to become a commercial infrastructure for service delivery • Web delivery, VoIP, streaming media … • Challenges for Internet-scale services • Scalability: 600M users, 35M Web sites, 2.1Tb/s • Efficiency: bandwidth, storage, management • Agility: dynamic clients/network/servers • Security, etc. • Focus on content delivery - Content Distribution Network (CDN) • Totally 4 Billion Web pages, daily growth of 7M pages • Annual traffic growth of 200% for next 4 years

How CDN Works

New Challenges for CDN • Large multimedia files ― Efficient replication • Dynamic content ― Coherence support • Network congestion/failures ― Scalable network monitoring

Existing CDNs Fail to Address these Challenges No coherence for dynamic content X Unscalable network monitoring - O(M ×N) M: # of client groups, N: # of server farms Non-cooperative replication inefficient

Provisioning (replica placement) Access/Deployment Mechanisms Granularity Non-cooperative Cooperative Per object Per cluster Per Website Pull Existing CDNs Push SCAN Network Monitoring Coherence Support Ad hoc pair-wise monitoring O(M×N) Tomography-based monitoring O(M+N) Unicast App-level multicast on P2P DHT IP multicast SCAN: Scalable Content Access Network

s1 s4 s5 SCAN Coherence for dynamic content s1, s4, s5 Cooperative clustering-based replication

SCAN Coherence for dynamic content X s1, s4, s5 Scalable network monitoring - O(M+N) M: # of client groups, N: # of server farms Cooperative clustering-based replication

Algorithm design Realistic simulation Evaluation of Internet-scale Systems iterate • Network topology • Web workload • Network end-to-end latency measurement Real evaluation? Analytical evaluation

Network Topology and Web Workload • Network Topology • Pure-random, Waxman & transit-stub synthetic topology • An AS-level topology from 7 widely-dispersed BGP peers • Web Workload • Aggregate MSNBC Web clients with BGP prefix • BGP tables from a BBNPlanet router • Aggregate NASA Web clients with domain names • Map the client groups onto the topology

Network E2E Latency Measurement • NLANR Active Measurement Projectdata set • 111 sites on America, Asia, Australia and Europe • Round-trip time (RTT) between every pair of hosts every minute • 17M daily measurement • Raw data: Jun. – Dec. 2001, Nov. 2002 • Keynote measurement data • Measure TCP performance from about 100 worldwide agents • Heterogeneous core network: various ISPs • Heterogeneous access network: • Dial up 56K, DSL and high-bandwidth business connections • Targets • 40 most popular Web servers + 27 Internet Data Centers • Raw data: Nov. – Dec. 2001, Mar. – May 2002

Clustering Web Content for Efficient Replication

Overview • CDN uses non-cooperative replication - inefficient • Paradigm shift: cooperative push • Where to push – greedy algorithms can achieve close to optimal performance [JJKRS01, QPV01] • But what content to be pushed? • At what granularity? • Clustering of objects for replication • Close-to-optimal performance with small overhead • Incremental clustering • Push before accessed: improve availability during flash crowds

Outline • Architecture • Problem formulation • Granularity of replication • Incremental clustering and replication • Conclusions • Future Research

3.GET request 4.GET request if cache miss 5. Response Local CDN server Local CDN server 6. Response 2. Reply: local CDN server IP address 1. Request for hostname resolution Local DNS server Client 2 CDN name server Local DNS server Conventional CDN: Non-cooperative Pull Client 1 Web content server ISP 1 Inefficient replication ISP 2

3.GET request if no replica yet Local CDN server Local CDN server 4. Response s2 Web content server 2. Reply: nearby replica server or Web server IP address 0. Push replicas 1. Request for hostname resolution 4. Response Local DNS server Client 2 3.GET request Local DNS server SCAN: Cooperative Push Client 1 CDN name server ISP 1 Significantly reduce the # of replicas and update cost ISP 2

Comparison between Conventional CDNs and SCAN

Problem Formulation • How to use cooperative push for replication to reduce • Clients’ average retrieval cost • Replica location computation cost • Amount of replica directory state to maintain • Subject to certain total replication cost (e.g., # of object replicas)

1 2 Per object 4 3 1 Per Web site 2 4 3

Replica Placement: Per Site vs. Per Object • 60 – 70% average retrieval cost reduction for Per object scheme • Per object is too expensive for management!

Overhead Comparison Where R: # of replicas per object M: total # of objects in the Website To compute on average 10 replicas/object for top 1000 objects takes several days on a normal server!

Overhead Comparison Where R: # of replicas per object K: # of clusters M: total # of objects in the Website (M >> K)

Clustering Web Content • General clustering framework • Define the correlation distance between objects • Cluster diameter: the max distance between any two members • Worst correlation in a cluster • Generic clustering: minimize the max diameter of all clusters • Correlation distance definition based on • Spatial locality • Temporal locality • Popularity

Object spatial access vector • Blue object 1 2 4 3 Spatial Clustering • Correlation distance between two objects defined as • Euclidean distance • Vector similarity

Clustering Web Content (cont’d) • Temporal clustering • Divide traces into multiple individuals’ access sessions [ABQ01] • In each session, • Average over multiple sessions in one day • Popularity-based clustering • OR even simpler, sort them and put the first N/K elements into the first cluster, etc.

Performance of Cluster-based Replication • Use greedy algorithm for replication • Spatial clustering with Euclidean distance and popularity-based clustering perform the best • Small # of clusters (with only 1-2% of # of objects) can achieve close to per-object performance, with much less overhead

Retrieval cost of static clustering almost doubles the optimal ! Static clustering and replication • Two daily traces: training traceand new trace • Static clustering performs poorly beyond a week

Incremental Clustering • Generic framework • If new object o matches with existing cluster c, add o to c and replicate o to existing replicas of c • Else create new cluster and replicate them • Two types of incremental clustering • Online: without any access logs • High availability • Offline: with access logs • Close-to-optimal performance

Object 1 <a href=“object2”> <a href=“object3”> <a href=“object4”> Object 4 <a href=“object3”> <a href=“object7”> Object 2 <a href=“object5”> <a href=“object6”> 1 1 2 2 4 4 3 3 7 7 5 5 6 6 Online Incremental Clustering • Predict access patterns based on semantics • Simplify to popularity prediction • Groups of objects with similar popularity? Use hyperlink structures! Groups of siblings Groups of the same hyperlink depth (smallest # of links from root)

access freq span= Online Popularity Prediction • Measure the divergence of object popularity within a group: • Experiments • Crawl http://www.msnbc.com with hyperlink depth 4, then group the objects • Use corresponding access logs to analyze the correlation • Groups of siblings have better correlation

1 1 2 3 4 5 6 6 5 1 3 4 2 4 3 + ? 2 6 5 Semantics-based Incremental Clustering • Put new object into existing cluster with largest number of siblings • In case of a tie, choose the cluster w/ more replicas • Simulation on MSNBC daily traces • 8-10am trace: static popularity clustering + replication • At 10am: M new objects - online inc. clustering + replication • Evaluated with 10-12am trace: each new object O(103) requests

Online Incremental Clustering and Replication Results 1/8 compared w/ no replication, and 1/5 for random replication

Online Incremental Clustering and Replication Results Double the optimal retrieval cost, but only 4% of its replication cost

Conclusions • Cooperative, clustering-based replication • Cooperative push: only 4 - 5% replication/update cost compared with existing CDNs • Clustering reduce the management/computational overhead by two orders of magnitude • Spatial clustering and popularity-based clustering recommended • Incremental clustering to adapt to emerging objects • Hyperlink-based online incremental clustering for high availability and performance improvement

Tie Back to SCAN • Self-organize replicas into app-level multicast tree for update dissemination • Scalable overlay network monitoring • O(M+N) instead of O(M×N), given M client groups and N servers • For more info: http://www.cs.berkeley.edu/~yanchen/resume.html#Publications

Future Research (I) • Measurement-based Internet study and protocol/architecture design • Use inference techniques to develop Internet behavior models • Network operators reluctant to reveal internal network configs • Root cause analysis: large, heterogeneous data mining • Leverage graphics/visualization for interactive mining • Apply deeper understanding of Internet behaviors for reassessment/design of protocol/architecture • E.g., Internet bottleneck – peering links? How and Why? Implications?

Future Research (II) • Network traffic anomaly characterization, identification and detection • Many unknown flow-level anomalies revealed from real router traffic analysis (AT&T) • Profile traffic patterns of new applications (e.g., P2P) –> benign anomalies • Understand the causes, patterns and prevalence of other unknown anomalies • Apply malicious patterns for intrusion detection • E.g., fight against Sapphire/Slammer Worm • Leverage Forensix for auditing and querying

Backup Materials

Tomography-based Network Monitoring B A 1 – P = (1 – l_0)(1 – l_1)(1 – l_2) P_i L_j O(M + N) Given O(M+N) end hosts, power-law degree topology imply O(M+N) links Transform to the topology matrix Pick O(M + N) paths to compute the link loss rates Use link loss rates to compute the loss rates of other paths M × N

Path Loss Rate Inference • Ideal case: rank = # of links (K) • Rank deficiency solved through topology transformation Topology transformation Real links Virtual links

Future Research (I) • Internet behavior modeling and protocol / architecture design • Use inference techniques to develop Internet behavior models • Root cause analysis: large, heterogeneous data mining • Leverage graphics/visualization for interactive mining • Leverage SciClone Cluster for parallel network tomography • Apply deeper understanding of Internet behaviors for reassessment/design of protocol/architecture • E.g., Internet bottleneck – peering links? How and Why? Implications?

Tomography-based Network Monitoring • Observations • # of lossy links is small, dominate E2E loss • Loss rates are stable (in the order of hours ~ days) • Routing is stable (in the order of days) • Identify the lossy links and only monitor a few paths to examine lossy links • Make inference for other paths End hosts Routers Normal links Lossy links

SCAN Coherence for dynamic content X s1, s4, s5 Cooperative clustering-based replication Scalable network monitoring O(M+N)

Problem Formulation • Subject to certain total replication cost (e.g., # of URL replicas) • Find a scalable, adaptive replication strategy to reduce avg access cost

SCAN: Scalable Content Access Network CDN Applications (e.g. streaming media) Provision: Cooperative Clustering-based Replication Coherence: Update Multicast Tree Construction Network Distance/ Congestion/ Failure Estimation User Behavior/ Workload Monitoring Network Performance Monitoring red: my work, black: out of scope

Evaluation of Internet-scale System • Analytical evaluation • Realistic simulation • Network topology • Web workload • Network end-to-end latency measurement • Network topology • Pure-random, Waxman & transit-stub synthetic topology • A real AS-level topology from 7 widely-dispersed BGP peers

Web Workload • Aggregate MSNBC Web clients with BGP prefix • BGP tables from a BBNPlanet router • Aggregate NASA Web clients with domain names • Map the client groups onto the topology

Efficient and Adaptive Replication using Content Clustering

Efficient and Adaptive Replication using Content Clustering

Presentation Transcript

Efficient Parameter-free Clustering Using First Neighbor Relations

Tomographic Image Reconstruction Using Content-Adaptive Mesh Modeling

Tomographic Image Reconstruction Using Content-Adaptive Mesh Modeling

Efficient Constraint Monitoring Using Adaptive Thresholds

Privacy protection in published data using an efficient clustering method

Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids

Oracle Clustering and Replication Technologies

Efficient Clustering of Cognitive Radio Networks Using Affinity Propagation

Adaptive and Efficient Mutual Exclusion

Clustering Web Content for Efficient Replication

Clustering of Web Content for Efficient Replication

Reliable MySQL Using Replication

Adaptive Hypermedia Content Authoring using MOT3.0

Efficient and Adaptive Replication using Content Clustering

Moving data using replication

Accurate, Efficient, and Adaptive Calling Context Profiling

Efficient Storage and Processing of Adaptive Triangular Grids using Sierpinski Curves

Clustering Web Content for Efficient Replication

Low Energy Adaptive Clustering Hierarchy (LEACH)

Low-Energy Adaptive Clustering Hierarchy

Eliminating Synchronization Bottlenecks in Object-Based Programs Using Adaptive Replication

Moving data using replication