340 likes | 496 Views
The Stretched Exponential Distribution of Internet Media Access Patterns. Lei Guo @ Yahoo! Inc. Enhua Tan @ Ohio State University Songqing Chen @ George Mason University Zhen Xiao @ Peking University Xiaodong Zhang @ Ohio State University. Presented in PODC 2008.
E N D
The Stretched Exponential Distribution of Internet Media Access Patterns Lei Guo @ Yahoo! Inc. Enhua Tan @ Ohio State University Songqing Chen @ George Mason University Zhen Xiao @ Peking University Xiaodong Zhang @ Ohio State University Presented in PODC 2008
Media content on the Internet • Video applications are mainstream • Video traffic is doubling every 3 to 4 months No. 3 • Yahoo • Google • YouTube
CDN ♫ wireless Seldom used in reality Existing media delivery systems P2P network • High CPU and bandwidth consumption, unsatisfactory QoS • Caching for performance improvement • A lot of research studies • Commercial products • Access patterns play an important role • Exploit temporal locality • High performance with cost effective hardware
log y y slope: -a heavy tail log i i Zipf model of Internet traffic patterns • Zipf distribution (power law) • Characterizes the property of scale invariance • Heavy tailed, scale free • 80-20 rule • Income distribution: 80% of social wealth owned by 20% people (Pareto law) • Web traffic: 80% Web requests access 20% pages (Breslau, INFOCOM’99) • System implications • Objectively caching the working set in proxy • Significantly reduce network traffic Reference rank distribution i : rank of objects yi : number of references a:0.6~0.8
Web media systems VoD media systems P2P media systems Live streaming and IPTV systems Does Internet media traffic follow Zipf’s law? Chesire, USITS’01: Zipf-like Cherkasova, NOSSDAV’02: non-Zipf Acharya, MMCN’00: non-Zipf Yu, EUROSYS’06: Zipf-like Veloso, IMW’02: Zipf-like Sripanidkulchai, IMC’04: non-Zipf Gummadi, SOSP’03: non-Zipf Iamnitchi, INFOCOM’04: Zipf-like
heuristic assumptions Inconsistent media access pattern models • Still based on the Zipf model • Zipf with exponential cutoff • Zipf-Mandelbrot distribution • Generalized Zipf-like distribution • Two-mode Zipf distribution • Fetch-at-most-once effect • Parabolic fractal distribution • … • All case studies • Based on one or two workloads • Different from or even conflict with each other • An insightful understanding is essential to • Content delivery system design • Internet resource provisioning • Performance optimization
Outline • Motivation and objectives • Stretched exponential of Internet media traffic • Dynamics of access patterns in media systems • Caching implications • Concluding remarks
Workload summary • 16 workloads in different media systems • Web, VoD, P2P, and live streaming • Both client side and server side • Different delivery techniques • Downloading, streaming, pseudo streaming • Overlay multicast, P2P exchange, P2P swarming • Data set characteristics • Workload duration: 5 days - two years • Number of users: 103 - 105 • Number of requests: 104 - 108 • Number of objects: 102 - 106 nearly all workloads available on the Internet all major delivery techniques data sets of different scales
log y yc fat head slope:-a b thin tail log i log i Stretched exponential distribution • Media reference rank follows stretched exponential distribution (verified Chi-square test) • Reference rank distribution: • fat head and thin tail in log-log scale • straight line in logx-yc scale (SE scale) i : rank of media objects (N objects) y : number of references c: stretch factor
Set 1: Web media systems (server logs) fat head thin tail ST-SVR-01 (15 MB) *HPLabs-99 (120 MB) *HPC-98 (14 MB) powered scaleyc log scale c = 0.22 R2 ~ 1 log scale in x axis x: rank of media object y: number of references to the object Y left: y^c scale R2: coefficient of determination (1 means a perfect fit) Y right: log scale HPC-98: enterprise streaming media server logs of HP corporation (29 months) HPLabs: logs of video streaming server for employees in HP Labs (21 months) ST-SVR-01: an enterprise streaming media server log workload like HPC-98 (4 months)
Set 2: Web media systems (req packets) fat head thin tail ST-CLT-05 (4.5 MB) ST-CLT-04 (2 MB) PS-CLT-04 (1.5 MB) powered scaleyc log scale log scale in x axis x: rank of media object y: number of references to the object PS-CLT-04: HTTP downloading/pseudo streaming requests, 9 days ST-CLT-04: RTSP/MMS on-demand streaming requests, 9 days ST-CLT-05: RTSP/MMS on-demand streaming requests, 11 days All collected from a big cable network hosted by a large ISP
mMoD-98: logs of a multicast Media-on-Demand video server, 194 days CTVoD-04: streaming serer logs of a large VoD system by China telecom, 219 days, reported as Zipf in EUROSYS’06 IFILM-06: number of web page clicks to video clips in IFILM site, 16 weeks (one week for the figure) YouTube-06: cumulative number of requests to YouTube video clips, by crawling on web pages publishing the data Set 3: VoD media systems fat head thin tail *CTVoD-04 (300 MB) *mMoD-98 (125 MB) powered scaleyc log scale log scale in x axis YouTube-06 (3.4 MB) IFILM-06 (2.25 MB)
Set 4: P2P media systems *KaZaa-03 (5 MB) BT-03 (636 MB) *KaZaa-02 (300 MB) KaZaa-02: large video file (> 100 MB. Files smaller than 100 MB are intensively removed) transferring in KaZaa network, collected in a campus network, 203 days. KaZaa-03: music files, movie clips, and movie files downloading in KaZaa network, 5 days, reported as Zipf in INFOCOM’04. BT-03: 48 days BitTorrent file downloading (large video and DVD images) recorded by two tracker sites
Set 5: Live streaming and movie pictures Movie-02 IMDB-06 Akamai-03 Akamai-03: server logs of live streaming media collected from akamai CDN, 3 months, reported as two-mode Zipf in IMC’04 Movie-02: US movie box office ticket sales of year 2002. IMDB-06: cumulative number of votes for top 250 movies in Internet Movie Database web site
Why Zipf observed before? ad server cache proxy media server • Media traffic is driven by user requests • Intermediate systems may affect traffic pattern • Effect of extraneous traffic (ads video insertion) • Filtering effect due to caching (through a proxy) • Biased measurements may cause Zipf observation
ads clip flag clip video prog ads clip flag clip video prog Extraneous media traffic ads server ads clip meta file link web server flag clip ad and flag video are pushed to clients mandatorily streaming media server video program
Reference rates without extraneous traffic SE 2004 SE 2005 prog ads flag Effects of extraneous traffic on reference rank distributions • Do not represent user access patterns • High request rate (high popularity) • High total number of requests • Not necessary Zipf with extraneous traffic • Extraneous traffic changes • Always SE without extraneous traffic • Small object sizes, small traffic volume with extraneous traffic 2004: 2 objects 2005: merged into 1 object Non-Zipf Zipf 2005 2004
Web workload: caching can cause a “flattened head” in log-log scale Stretched exponential is not caused by caching effect Local replay events can be traced by WM/RM streaming media protocols log y Zipf Filtered by Web cache log i log y Stretched exponential log i Caching effect Server logs Cache validation packet sniffer Play summary
“Rich-get-richer” and Zipf / power law Pareto law: income distribution Web access is Zipf Popular pages can attract more users Pages update to keep popular Yahoo ranks No.1 more than six years Zipf-like for long duration Media access is not Popularity decreases with time exponentially Media objects are immutable Rich-get-richer not present Non-Zipf in long duration CCDF of req (log) 103 ------ raw data ------ linear fit 102 BitTorrent media file 101 100 0 100 200 Time after object birth (day) Why media access pattern is not Zipf --- Web --- Video 16 1 Number of distinct weekly top N popular objects in 16 weeks Top 1 Web object never changes Top 1 video object changes every week
Outline • Motivation and objectives • Stretched exponential of Internet media traffic • Dynamics of access patterns in media systems • Caching implications • Concluding remarks
yc c: stretch factor # of references slope:-a rank log i Dynamics of access patterns in media systems • Media reference rank distribution in log-log scale • Different systems have different access patterns • The distribution changes over time in a system (NOSSDAV’02) • All follow stretched exponential distribution • Stretch factor c • Minus of slope a • Physical meanings • Media file sizes • Aging effects of media objects • Deviation from the Zipf model
Stretch factor and media file size • Other issues • Different encoding rates (file length in time is a better metric) • Different content type: entertainment, educational, business • Stretch factor c reflects the view time of a video object EDU file size vs. stretch factor c • 0 – 5 MB: c <= 0.2 • 5 – 100 MB: 0.2 ~ 0.3 • > 100 MB: c >= 0.3 BIZ c increases with file size
# requested obj. over time slope =obj yc c: stretch factor # of references slope:-a 0 rank log i Stretched exponential parameters over time • In a media system over time t • Constant request rate req • Constant object birth rate obj • Constant median file size • Stretch factor c is a time invariant constant • Parameter a increases with time t Objects created in (-, 0]: ~O(log t) Popularity decreases exponentially Objects created in [0, t)
Evolution of media reference rank distribution Web media P2P media Parametera Parameter a increase with time converge to a constant Caused by objects created before workload collection Reference rank distribution (slope = -a)
SE Zipf F E |EF||OE| Big files Deviation from the Zipf model reference # (log) EF OE O rank (log) • a increases with c(c < 2) • a increases with req/obj • a increases with t Big media files have large deviation Deviation increases with time
Example: YouTube Video Measurements in IMC’07 80 days in a campus network All downloads for all time Zipf non-Zipf Campus users: small request rate req Global users: large request rate req No old objects N’(t) = 0 Short time, old objects dominant: N’(t) >> objt Small a, small deviation Large a, large deviation
Outline • Motivation and objectives • Stretched exponential of Internet media traffic • Dynamics of access patterns in media systems • Caching implications • Conclusion
Analyze media caching for short term workload Distribution is stationary (evolution takes time) Requests are independent Object has unit size log y logy Zipf-like Stretched exponential log i log i Caching analysis methodology • Intuition from the distribution shape Highly concentrated requests Less concentrated requests
Modeling caching performance Asymptotic analysis for small cache size k (k << N) Zipf SE Web media Parameter selection Zipf: typical Web workload ( = 0.8) SE: typical streaming media workload (c = 0.2, a = 0.25, same as ST-CLT-05) Media caching is far less efficient than Web caching
Potential of long term media caching Short term: Requests dominated by old objects, dilute concentration Long term: Requests dominated by new objects Request concentration: significantly increased Request correlation: objects can be purged when unpopular To achieve maximal concentration Very long time (months to years) and huge amount of storage Peer-to-peer systems are scalable for this purpose CDF of requests in 11 days CDF of object ages old objects object popularity decreases exponentially new objects
Concluding Remarks • Media access patternsdo not fitZipf model • Media access patterns arestretched exponential • Our findings implies that • Client-server based proxy systemsare not effectiveto deliver media contents • P2P systemsare most suitable for this purpose • We provide ananalytical basisfor the effectiveness of P2P media content delivery infrastructure