270 likes | 367 Views
Supporting Content-Addressable Caching with CZIP Compression. KyoungSoo Park , Sunghwan Ihm, Mic Bowman* and Vivek Pai Princeton University *Intel Research. Content-Based Naming (CBN). Naming scheme based on its content Name = one-way hash (content) Hashing function: MD5, SHA-1, etc.
E N D
Supporting Content-Addressable Caching with CZIP Compression KyoungSoo Park, Sunghwan Ihm, Mic Bowman* and Vivek Pai Princeton University *Intel Research
Content-Based Naming (CBN) • Naming scheme based on its content • Name = one-way hash (content) • Hashing function: MD5, SHA-1, etc. • Rabin’s fingerprint for chunk detection • Redundancy elimination • Network-traffic/storage systems • Research/commercial systems • Special-purpose systems USENIX 2007
Where Can CBN be Applied? • Similar file distribution • Linux distribution mirror • DVD ISO contains all CD ISOs • Virtual machine image migration • Base OS takes up majority of content • httpd VM vs. httpd+mysqld VM • Uncacheable Web content • Some dynamic content doesn’t change USENIX 2007
Contribution of This Work • Generic CBN tool • Easy to build new systems • Easy to upgrade existing non-CBN systems • CZIP compression + CZIP-aware apps • Can be used on existing platforms • Provides benefit to non-CZIP apps • Demonstrate sample systems • Reduces FC6 mirror memory footprint by half • Comparable compression speed to GZIP’s • 2x throughput for CZIP-aware Apache • 4x origin server BW reduction for CZIP-aware CDN USENIX 2007
Header A Global Fields A Chunk Index 1 B B Chunk Index 2 Chunk Index 3 A Chunk Index 4 C B Chunk Index 5 C CZIP Compression • Compression scheme like GZIP, BZIP2 • Export CBN information in the header CZIP UNCZIP CZIP Header USENIX 2007
CZIP Header • Header = global attributes + chunk info • Global attributes • One-way hash function (SHA-1/MD5) • Chunk data compression (GZIP/BZIP2) • Convergent encryption (on/off) • Header CRC, File Hash, etc. • Chunk information • Content hash, start offset, chunk size USENIX 2007
read header file1.cz read chunks read header file2.cz xyzlo5g Chunk A read chunk C asdfghk Chunk B qoiertty Chunk C Deployment Scenario • CZIP-aware server xyzlo5g hdr asdfghk Client A Chunk A Server Chunk B file1.cz CBN Cache Client B xyzlo5g header asdfghk qoiertty Chunk A Chunk B Chunk C file2.cz USENIX 2007
GET /file2.cz Range: bytes=1000-1999 X-SHA-1: qoiertty file2.cz read chunk C xyzlo5g Chunk A asdfghk Chunk B qoiertty Chunk C Deployment Scenario • CZIP-aware client-side proxy xyzlo5g hdr asdfghk file1.cz Client A Chunk A Proxy Server Chunk B file1.cz CBN Cache Client B xyzlo5g header asdfghk qoiertty Chunk A 1. X-SHA-1 field helps CZIP-aware server 2. Browser cache can support CBN too! Chunk B Chunk C file2.cz USENIX 2007
7.9 6.5 6.5 48.3 48.5 3.3 3.2 3.2 20.3 19.9 19.6 2.7 2.5 2.5 1.9 Compressibility • Fedora Core 6 ISOs/ All files/ Wikipedia DB 1 Data Compression Ratio CZIP+plain 0.9 CZIP+gzip 0.8 CZIP+bzip2 0.7 GZIP 0.6 BZIP2 0.5 0.4 0.3 0.2 0.1 0 FC6_i386_ISOs.tar FC6_All_files.tar Wikipedia_DB.tar 6.7 GB 49.7 GB 7.9 GB USENIX 2007
Compression speed • On Pentium D 2.8GHz with 4GB memory 29,004 secs 3,151 secs 3,964 secs USENIX 2007
Virtual Machine Images • Server consolidation/management • Much redundancy among similar VMs • Xen FC4 base image (X) • X + httpd (Y) / Y + mysqld (Z) • Investigating content overlap over • Chunk size • Chunking methods • Rabin’s fingerprint vs. fixed-sized • After extensive use USENIX 2007
Chunk Size / Chunking Methods Compare three VM images Base = Xen FC4 image / Apache = Base + httpd Both = Apache + mysqld Rabin’s fingerprint Fixed-sized chunking USENIX 2007
Real VM Images EC1 ~ EC5: VMs based on Xen FC-4 + standard tools Daily used by five different engineers for three weeks USENIX 2007
Dynamic Web Pages • Observed the front page of these sites • Google News • CNN • Slashdot • Digg.com • Fark.com • New York Times • All of them non-cacheable • “no-cache”, “no-store” or “private” USENIX 2007
Average Content Overlap Downloaded pages every 10 minutes for 18 days USENIX 2007
Potential Data Savings via CZIP 37% 39% 61% 24% 57% 90% USENIX 2007
Summary So far • CZIP is comparable to GZIP in speed and performance • CZIP is far better with files with much redundancy • Redundancy decreases as chunk size increases • Rabin’s fingerprint exposes a good deal of redundancy regardless of chunk sizes • Optimal chunk size varies over workload • Bigger chunk size is better for network transfer • Dynamic content also exposes redundancy • CZIP can save 24-90% of BW instead of GZIP USENIX 2007
Server Performance • CZIP Apache Module • Test scenario (FC mirror simulation) • 1.5 GB from FC6 DVD • 1.5 GB is split into three 0.5 GB images • Each file is requested in round-robin fashion • 100-300 clients simulated by six machines in LAN • Server is 2.8GHz Pentium D w/ 2GB memory • w/ 2GB physical memory with 2 Gbps-NICs USENIX 2007
Worst client in CZIP-aware Apache is faster than 91% of normal Apache clients CZIP Apache Module 90% 2.56 times Median 2.07 times USENIX 2007
CBN-Aware Content Distribution • CoBlitz large-file CDN [NSDI’06] • Serving 1-2 TB every day on PlanetLab • http://coblitz.codeen.org/URL • University channel – podcast/vodcast • Fedora Core mirror, Citeseer etc. • Chunk is basic caching unit • Parallel chunk requests/responses • Chunk request in HTTP byte-range query USENIX 2007
Making CoBlitz CZIP-Aware • CoBlitz’s chunk request GET /coblitz.codeen.org/www.cs.princeton.edu/ bigfile.cz,start=1000,end=1999 HTTP/1.0 Host: coblitz.codeen.org • CZIP-aware CoBlitz (C-CoBlitz) request GET /czip.codeen.org/Chunk_SHA-1_Hash HTTP/1.0 Host: czip.codeen.org X-URL: www.cs.princeton.edu/bigfile.cz X-Range: byte=1000-1999 USENIX 2007
CZIP-Aware CoBlitz Testing • Two content-overlapping files • Simultaneously fetch from 100 PlanetLab nodes • Origin server is at Princeton • Testing cases • Regular: Download original files by regular CoBlitz • File-CZIP: DownloadCZIP’ed files by regular CoBlitz • CZIP-CDN: DownloadCZIP’ed files by C-CoBlitz USENIX 2007
273 MB, 29.6% 191 MB, 29.7% 100 MB File Downloading 388 MB Regular File-CZIP CZIP-CDN USENIX 2007
92 MB, 49.7% 24 MB, 73.9% 50 MB File Downloading 183 MB Regular File-CZIP CZIP-CDN USENIX 2007
Conclusion • CZIP is a generic compression tool providing CBN benefits • CZIP is comparable to GZIP in compression performance • CZIP helps greatly reduce memory footprint in serving similar files • It is very easy to support CZIP and the benefit is transparent USENIX 2007
Thank you! More information can be found at http://codeen.cs.princeton.edu/czip/ CZIP code will be released soon! USENIX 2007
200/300 Clients 90% 2.27 times 90% 2.11 times 80% 65% Median 1.95 times Median 1.84 times 200 clients 300 clients USENIX 2007