420 likes | 561 Views
YouTube Traffic Characterization: A View From the Edge. P. Gill, M. Arlitt, Z. Li, A. Mahanti ACM Internet Measurement Conference (IMC) San Diego, CA, USA October 2007. Introduction. Web is metamorphosing Easy for users to create, share and distribute content - social networks
E N D
YouTube Traffic Characterization: A View From the Edge P. Gill, M. Arlitt, Z. Li, A. Mahanti ACM Internet Measurement Conference (IMC) San Diego, CA, USA October 2007
Introduction • Web is metamorphosing • Easy for users to create, share and distribute content - social networks • Facebook, MySpace, Flickr, YouTube, LinkedIn • Shared video! • Often calledWeb 2.0 • Users don’t just consume content but post their own • Tag to indicate interest
Study of YouTube (1 of 2) • YouTube largest video sharing site on Internet [29] • 100 million views per day • Up to 60% of videos watched on the Internet • 65,000 uploads per day • This paper studies local and global view
Study of YouTube (2 of 2) • Provide measurement framework • Extended time study while considering privacy • Characterize Web 2.0 traffic • Examine implications of observed traffic • Usage patterns, file properties, popularity, etc. • Effectiveness of caching
Outline • Introduction • Background • Related Work • Data Collection Framework • Analysis • Conclusions
Background (1 of 2) • YouTube founded Feb 2005 • Initially, only video sharing • Added ability to “tag” with keywords or phrases to describe content • Time magazine 2006 person of the year was “You” • Acquired by Google for $1.65 billion 2006
Background (2 of 2) • Uses Adobe Flash Video (.flv) • Doesn’t usually require user installation • 90% of users have Flash installed • Users upload many types (.wmv, .mpeg, .avi) • Converted to .flv • Delivery over HTTP (TCP), but starts playback before finishing • Many other requests (references to pages, video links, etc.)
Related Work • Many traditional Web workloads analyzed [6,7,18,20,24,31] • Caching should improve experience • Does this hold for Web 2.0? • Media stored on the Web [2,15,25,28,38] and media in corporations [3,4,14,16,27,43] • Sizes heavy-tailed • Arrival influenced by time of day • Popularity fits zipf distribution (occurrence inverse ot rank) • Some other YouTube crawling [13, 26] but not also from the edge
Outline • Introduction • Background • Related Work • Data Collection Framework • Analysis • Conclusions
Data Collection Framework • Multiple levels • Monitor usage at University of Calgary • 28k students, 5.3k staff • Understand how YouTube used • Collect stats on most popular videos • Can compare and contrast with local popularity
Local Collection • Collect data on all YouTube usage at U of Calgary • For extended period of time • Challenging • Old hardware • Many servers (CDNs) • Lengthy traces • 300 Mb/s full duplex (WPI is 500 Mb/s)
Local Collection Methodology • Identify servers that provide YouTube content • Use bro [9] to collect info on HTTP transactions to those servers • bro is intrusion detection system • Restart bro daily, compress logs (more on each, next)
1. Identify Server • tcpdump to identify servers while browsing YouTube • whois to determine affiliation • youtube and youtube2 • Gather traces to any IP on those two networks • Was a CDN (Limelight Networks) but uses YouTube in HTTP Host: field
2. Extract Summaries (1 of 4) • bro extracts summaries of HTTP transactions in real-time • Record TCP connection (duration, RTT), HTTP request (method, host), HTTP response (status, length, data) • Privacy • Convert YouTube visitor ID to unique number (mapping not recorded) • Valid only for 24 hours (until bro restarted) • So, can look at some longevity (one day) but don’t know users
2. Extract Summaries (2 of 4) • Complete – connection fully parsed • Interrupted – TCP connection reset • Gap – monitor missing packet • Failure – unable to parse (Only about 9% fail)
2. Extract Summaries (3 of 4) • Very few completed • Load higher for video and longer • Fortunately, most analysis from headers • Ex: can analyze interrupted too • Next … why videos interrupted?
2. Extract Summaries (4 of 4) 10% interrupted transactions had slower download rate than bitrate
Global Collection Methodology • Crawling not permitted and not practical • Focus on top 100 most viewed videos • Pareto principle 80-20 rule says 80% traffic, 20% videos (sometimes 90-10) • Two steps: • Retrieve pages listing most viewed videos (day, week, month, all time) • Each scattered over 20 pages • Use APIs to get statistics on videos
Outline • Introduction • Background • Related Work • Data Collection Framework • Analysis • Conclusions
Local Summary Statistics • 85 days, 1/14/2007 to 4/8/2007 • Includes mid-semester break • 23,250,438 valid HTTP transactions • Only 3% for video transfer (625,593), but account for 99% of bytes transferred • 50% for previous requests (have potential to be cached)
HTTP Request Methods • Most GET (pages and videos) • POST to rate, comment and upload videos • Note, 28,655 / 625,593 about 5% POSTs compared to video retrievals • Only 133 uploads (0.01%), • YouTube says about 0.07% … maybe residential users POST more?
Content Types • Analyze HTTP 200 responses (full size content) • Images and text = 86%, Applications = 10%, Videos = 3% • Videos 98.6% of bytes • Note middle – videos large, mean and median similar (not skewed) and CoV less than one • Note transfer sizes similar, except for applications
Local YouTube Utilization • Note cycles (what are they from?) • Number increasing gradually • Popularity plus press • Video not so many transactions (log scale), but most of the bytes • Focus on video for the rest of analysis
Use of CDN • Not so much … maybe there is a cost to Google?
Time of Day • Peaks in the day, but still a lot 12-4am (dorms?) • Weekdays more than weekend • A residential campus maybe more even than workplace or homeplace
Global Characteristics • 85 days * 100 videos/category/day = 8500 videos / category • Daily varies, but others much more slowly • Typical ratings 4+ (out of 5), mean and median similar • Videos with long-term popularity tend to be shorter (2.5-3.5 minutes) • But converse (short duration more popular) may not be true
Video Characteristics – File Size • Applications small (not logscale on x-axis) • Images, too (probably thumbnails) • Videos orders of magnitude larger than typical Web • Note, YouTube policy of 100 MB limit (but a few larger) • Only 10% video requests larger than 22 MB
Video Characteristics – Duration • Using YouTube API • Cap at 10 minutes (but “director” accounts can be longer) • Some much longer (one 275 days!), so error maybe when converting to .flv • Limit analysis to those 2 hours or less • Mean 4.15, median 3.33, CoV 1 • Slightly longer than [Li et al.] (was about 2 minutes)
Video Characteristics – Bit Rate (Local) • Derived from file size / duration • Small number extremely low (10 Kbps) – dialup (Li had 30%) • Median 328 Kbps – broadband (Li had 200 Kbps) • Most between 300-400 Kbps, somewhat higher since networks getting high bitrates?
Video Characteristics – Age of Videos • Diff in upload time (API), and observed or most popular appearance • Daily almost all less than 3 days • All time older • Retrieval on campus is closer to all time • But have they been updated?
Video Characteristics – Age of Videos • Time since update (comments, rating …) • Campus viewers may enjoy older content, still provide feedback
Video Characteristics – Age of Videos • Applications and text 14 days or so • Video and image longer, 50% 90 days or so • Potential for video to be cached
Video Characteristics – Ratings • Important part of Web 2.0 is user interaction • Users can rate videos with 0 to 5 “stars” • Mean 3+ over 80% of time (users like what they watch) • Popular lists 4+, Campus 4.18
Video Characteristics – Categories • Daily is news and sports, then decrease recent • Comedy entertainment music long term • DIY, etc. not popular … suggests users watch YouTube videos for entertainment not information
File Popularity • Important for building systems for planning and caching. • Zipf analysis • Rank objects most to least popular • Frequency (F) related to rank of object (R) by: F ~ R-β, where β is close to 1 • To determine, plot list on log-log scale • If line, then zipf
File Popularity - Zipf Analysis • β = 0.56 • Regression fit of 0.97
File Popularity – Concentration Analysis • Not 80-20 (Pareto rule) • 68% of videos one-timer (about 13.6% of bytes) • Maybe more content? (Fig 16 broken online)
Working Set Analysis • 10% of videos same as previous day • By end, about 600k total versus 320k unique • Potential savings of factor of 2 in bw (about 3.19 Tbytes) • About ½ of global popular videos viewed on campus • But only make up about 1% of videos on campus, maybe because referred to others by friends
Discussion – Web 2.0 • Large MM content for storage and bandwidth • YouTube grows 19.5 TB per month! • Caching and CDNs can still help • 4.6% of campus traffic YouTube • But long-tail (many viewed once) • Servers need multi-core/threads (long transactions) and lots of memory
Conclusions • Web 2.0 demands better understanding to plan, build and deploy better systems • Examine YouTube, locally and globally • Find: • Many similarities with Web 1.0 • Access patterns correlate with time of day, day of week, month • Video files larger, some more popular • Caching can help • Videos longer (order of magnitude) • Access not 80-20
Future Work • More powerful monitor (no “gapped” transactions) • Decompose all traffic: MySpace, Flickr,… • Audio? • Other video sites?