590 likes | 688 Views
Efficient and Adaptive Replication using Content Clustering. Yan Chen EECS Department UC Berkeley. Motivation. Internet has evolved to become a commercial infrastructure for service delivery Web delivery, VoIP, streaming media … Challenges for Internet-scale services
E N D
Efficient and Adaptive Replication using Content Clustering Yan Chen EECS Department UC Berkeley
Motivation • Internet has evolved to become a commercial infrastructure for service delivery • Web delivery, VoIP, streaming media … • Challenges for Internet-scale services • Scalability: 600M users, 35M Web sites, 28Tb/s • Efficiency: bandwidth, storage, management • Agility: dynamic clients/network/servers • Security, etc. • Focus on content delivery - Content Distribution Network (CDN) • Totally 4 Billion Web pages, daily growth of 7M pages • Annual growth of 200% for next 4 years
CDN and its Challenges No coherence for dynamic content X Inefficient replication Unscalable network monitoring - O(M*N)
SCAN: Scalable Content Access Network CDN Applications (e.g. streaming media) Provision: Cooperative Clustering-based Replication Coherence: Update Multicast Tree Construction Network Distance/ Congestion/ Failure Estimation User Behavior/ Workload Monitoring Network Performance Monitoring red: my work, black: out of scope
SCAN Coherence for dynamic content s1, s4, s5 Cooperative clustering-based replication
SCAN Coherence for dynamic content X s1, s4, s5 Cooperative clustering-based replication Scalable network monitoring - O(M+N)
Internet-scale Simulation • Network Topology • Pure-random, Waxman & transit-stub synthetic topology • An AS-level topology from 7 widely-dispersed BGP peers • Web Workload • Aggregate MSNBC Web clients with BGP prefix • BGP tables from a BBNPlanet router • Aggregate NASA Web clients with domain names • Map the client groups onto the topology
Internet-scale Simulation – E2E Measurement • NLANR Active Measurement Projectdata set • 111 sites on America, Asia, Australia and Europe • Round-trip time (RTT) between every pair of hosts every minute • 17M daily measurement • Raw data: Jun. – Dec. 2001, Nov. 2002 • Keynote measurement data • Measure TCP performance from about 100 worldwide agents • Heterogeneous core network: various ISPs • Heterogeneous access network: • Dial up 56K, DSL and high-bandwidth business connections • Targets • 40 most popular Web servers + 27 Internet Data Centers • Raw data: Nov. – Dec. 2001, Mar. – May 2002
Overview • CDN uses non-cooperative replication - inefficient • Paradigm shift: cooperative push • Where to push – greedy algorithms can achieve close to optimal performance [JJKRS01, QPV01] • But what content to be pushed? • At what granularity? • Clustering of objects for replication • Close-to-optimal performance with small overhead • Incremental clustering • Push before accessed: improve availability during flash crowds
Outline • Architecture • Problem formulation • Granularity of replication • Incremental clustering and replication • Conclusions • Future Research
5.GET request 6.GET request if cache miss 7. Response Local CDN server Local CDN server 4. local CDN server IP address 8. Response 1. GET request 2. Request for hostname resolution 3. Reply: local CDN server IP address Local DNS server Client 2 CDN name server Local DNS server Conventional CDN: Non-cooperative Pull Client 1 Web content server ISP 1 Inefficient replication ISP 2
5.GET request if no replica yet 6. Response Local CDN server Local CDN server 4. Redirected server IP address 1. GET request 2. Request for hostname resolution Web content server 3. Reply: nearby replica server or Web server IP address 0. Push replicas 6. Response Local DNS server Client 2 5.GET request Local DNS server SCAN: Cooperative Push Client 1 CDN name server ISP 1 Significantly reduce the # of replicas and update cost ISP 2
Problem Formulation • Find a scalable, adaptive replication strategy to reduce • Clients’ average retrieval cost • Replica location computation cost • Amount of replica directory state to maintain • Subject to certain total replication cost (e.g., # of URL replicas)
Outline • Architecture • Problem formulation • Granularity of replication • Incremental clustering and replication • Conclusions • Future Research
1 2 Per URL 4 3 1 Per Web site 2 4 3
Replica Placement: Per Website vs. Per URL • 60 – 70% average retrieval cost reduction for Per URL scheme • Per URL is too expensive for management!
Overhead Comparison Where R: # of replicas per URL M: # of URLs To compute on average 10 replicas/URL for top 1000 URLs takes several days on a normal server!
Overhead Comparison Where R: # of replicas per URL K: # of clusters M: # of URLs (M >> K)
Clustering Web Content • General clustering framework • Define the correlation distance between URLs • Cluster diameter: the max distance between any two members • Worst correlation in a cluster • Generic clustering: minimize the max diameter of all clusters • Correlation distance definition based on • Spatial locality • Temporal locality • Popularity
URL spatial access vector • Blue URL 1 2 4 3 Spatial Clustering • Correlation distance between two URLs defined as • Euclidean distance • Vector similarity
Clustering Web Content (cont’d) • Temporal clustering • Divide traces into multiple individuals’ access sessions [ABQ01] • In each session, • Average over multiple sessions in one day • Popularity-based clustering • OR even simpler, sort them and put the first N/K elements into the first cluster, etc. - binary correlation
Performance of Cluster-based Replication • Spatial clustering with Euclidean distance and popularity-based clustering perform the best • Small # of clusters (with only 1-2% of # of URLs) can achieve close to per-URL performance, with much less overhead MSNBC, 8/2/1999, 5 replicas/URL
Outline • Architecture • Problem formulation • Granularity of replication • Incremental clustering and replication • Conclusions • Future Research
Performance of static clustering almost doubles the optimal ! Static clustering and replication • Two daily traces: training traceand new trace • Static clustering performs poorly beyond a week
Incremental Clustering • Generic framework • If new URL u match with existing clusters c, add u to c and replicate u to existing replicas of c • Else create new clusters and replicate them • Two types of incremental clustering • Online: without any access logs • High availability • Offline: with access logs • Close-to-optimal performance
URL2 <a href=“URL5”> <a href=“URL6”> URL4 <a href=“URL3”> <a href=“URL7”> URL1 <a href=“URL2”> <a href=“URL3”> <a href=“URL4”> 1 1 2 2 4 4 3 3 7 7 5 5 6 6 Online Incremental Clustering • Predict access patterns based on semantics • Simplify to popularity prediction • Groups of URLs with similar popularity? Use hyperlink structures! Groups of siblings Groups of the same hyperlink depth (smallest # of links from root)
access freq span= Online Popularity Prediction • Measure the divergence of URL popularity within a group: • Experiments • Crawl http://www.msnbc.com on 5/3/2002 with hyperlink depth 4, then group the URLs • Use corresponding access logs to analyze the correlation • Groups of siblings have the best correlation
1 1 2 3 4 5 6 6 5 1 3 4 2 4 3 + ? 2 6 5 Semantics-based Incremental Clustering • Put new URL into existing cluster with largest # of siblings • In case of a tie, choose the cluster w/ more replicas • Simulation on 5/3/2002 MSNBC • 8-10am trace: static popularity clustering + replication • At 10am: 16 new URLs - online inc. clustering + replication • Evaluation with 10-12am trace: 16 URLs has 33K requests
Online Incremental Clustering and Replication Results 1/8 compared w/ no replication, and 1/5 for random replication
Online Incremental Clustering and Replication Results Double the optimal retrieval cost, but only 4% of its replication cost
Conclusions • Cooperative, clustering-based replication • Cooperative push: only 4 - 5% replication/update cost compared with existing CDNs • URL Clustering reduce the management/computational overhead by two orders of magnitude • Spatial clustering and popularity-based clustering recommended • Incremental clustering to adapt to emerging URLs • Hyperlink-based online incremental clustering for high availability and performance improvement • Self-organize replicas into app-level multicast tree for update dissemination • Scalable overlay network monitoring • O(M+N) instead of O(M*N), given M client groups and N servers
Outline • Architecture • Problem formulation • Granularity of replication • Incremental clustering and replication • Conclusions • Future Research
Future Research (I) • Measurement-based Internet study and protocol/architecture design • Use inference techniques to develop Internet behavior models • Network operators reluctant to reveal internal network configurations • Root cause analysis: large, heterogeneous data mining • Leverage graphics/visualization for interactive mining • Apply deeper understanding of Internet behaviors for reassessment/design of protocol/architecture • E.g., Internet bottleneck – peering links? How and Why? Implications?
Future Research (II) • Network traffic anomaly characterization, identification and detection • Many unknown flow-level anomalies revealed from real router traffic analysis (AT&T) • Profile traffic patterns of new applications (e.g. P2P) –> benign anomalies • Understand the cause, pattern and prevalence of other unknown anomalies • Identify malicious patterns for intrusion detection • E.g., fight against Sapphire/Slammer Worm
SCAN Coherence for dynamic content X s1, s4, s5 Cooperative clustering-based replication Scalable network monitoring O(M+N)
Problem Formulation • Subject to certain total replication cost (e.g., # of URL replicas) • Find a scalable, adaptive replication strategy to reduce avg access cost
Simulation Methodology • Network Topology • Pure-random, Waxman & transit-stub synthetic topology • An AS-level topology from 7 widely-dispersed BGP peers • Web Workload • Aggregate MSNBC Web clients with BGP prefix • BGP tables from a BBNPlanet router • Aggregate NASA Web clients with domain names • Map the client groups onto the topology
Online Incremental Clustering • Predict access patterns based on semantics • Simplify to popularity prediction • Groups of URLs with similar popularity? Use hyperlink structures! • Groups of siblings • Groups of the same hyperlink depth: smallest # of links from root
Challenges for CDN • Over-provisioning for replication • Provide good QoS to clients (e.g., latency bound, coherence) • Small # of replicas with small delay and bandwidth consumption for update • Replica Management • Scalability: billions of replicas if replicating in URL • O(104) URLs/server, O(105) CDN edge servers in O(103) networks • Adaptation to dynamics of content providers and customers • Monitoring • User workload monitoring • End-to-end network distance/congestion/failures monitoring • Measurement scalability • Inference accuracy and stability
replica cache always update adaptive coherence client Tapestry mesh SCAN Architecture • Leverage Decentralized Object Location and Routing (DOLR) - Tapestry for • Distributed, scalable location with guaranteed success • Search with locality • Soft state maintenance of dissemination tree(for each object) data source data plane Dynamic Replication/Update and Content Management Web server Request Location SCAN server network plane
Distance measured from a host to its monitor Distance measured among monitors Wide-area Network Measurement and Monitoring System (WNMMS) • Select a subset of SCAN servers to be monitors • E2E estimation for • Distance • Congestion • Failures Cluster C Cluster B Cluster A network plane Monitors SCAN edge servers Clients
Dynamic Provisioning • Dynamic replica placement • Meeting clients’ latency and servers’ capacity constraints • Close-to-minimal # of replicas • Self-organized replicas into app-level multicast tree • Small delay and bandwidth consumption for update multicast • Each node only maintains states for its parent & direct children • Evaluated based on simulation of • Synthetic traces with various sensitivity analysis • Real traces from NASA and MSNBC • Publication • IPTPS 2002 • Pervasive Computing 2002
Effects of the Non-Uniform Size of URLs 1 • Replication cost constraint : bytes • Similar trends exist • Per URL replication outperforms per Website dramatically • Spatial clustering with Euclidean distance and popularity-based clustering are very cost-effective 2 4 3
Diagram of Internet Iso-bar Cluster C Cluster B Cluster A Landmark End Host
Diagram of Internet Iso-bar Distance probes from monitor to its hosts Distance probes among monitors Cluster C Cluster B Cluster A Landmark Monitor End Host
Real Internet Measurement Data • NLANR Active Measurement Projectdata set • 119 sites on US (106 after filtering out most offline sites) • Round-trip time (RTT) between every pair of hosts every minute • Raw data: 6/24/00 – 12/3/01 • Keynote measurement data • Measure TCP performance from about 100 agents • Heterogeneouscore network: various ISPs • Heterogeneousaccess network: • Dial up 56K, DSL and high-bandwidth business connections • Targets • Web site perspective: 40 most popular Web servers • 27 Internet Data Centers (IDCs)