370 likes | 389 Views
This study examines characteristics of access in enterprise media servers for better service provisioning. It explores access patterns, QoS metrics extraction, locality properties, evolution of site content, and more using MediaMetrics. Data from HP Corporate Media Solutions and HPLabs servers is analyzed to understand sessions, encoding rates, interactivity, and bandwidth trends.
E N D
Characterizing Locality, Evolution, and Life Span of Accesses in Enterprise Media Server Workloads Ludmila Cherkasova and Minaxi Gupta Hewlett-Packard Labs
Introduction • Streaming media – a new wave of rich Internet content • Video is popular for: • News • Sports • Entertainment • Education • Training • Enterprise media servers: • Online advertisement • Web marketing • Customer interaction centers • Collaboration • Training
Challenges • Streaming media delivery challenges: • Real time • High bandwidth • Magnitude amount of storage • Sensitivity to network congestion • Understanding the nature of media server workloads is crucial for properly provisioning current and future services
Related Work • Studies of educational workloads • Non streaming multimedia stored on web servers (Acharya et al., 1998) • mMod (multicast Media on demand) with mix of educational and entertainment content (Acharya et al., 2000) • eTeach and BIBS (Almeida et al., 2001) • Media proxy analysis (University of Washington, Chesire at al., 2001) • Results showed very little locality: 78% of files are accessed once
Goals of Our Study • Characterize access patterns for enterprise media servers • Extract some QoS related metrics for media sites (from the logs) • Characterize locality properties and compare them with traditional web workloads characterization • Characterize evolution of site content and rate of changes on the site • Two new metrics: new files impact and life span • Characterize dynamics of the sites and growth trends • Design a tool (MediaMetrics) for service providers
Data Collection Sites • HP Corporate Media Solutions server (HPC), for over 2.5 year: November, 1998 to April, 2001 (Windows Media Server) • Video coverage of major events • Keynote speeches, addresses, and presentations • Meetings with industry analysts • Promotional events and product introduction • Demos of product usage • HPLabs Media Server (HPLabs), for 1 year 9 months: July, 1999 to April, 2001 (RealServer G2), internal server • Coffee talks, prominent presentations, seminars, meetings • Cooltown videos • HP wide business events, etc
Media Server Log Formats • Media access logs record information about all request and responses processed by media server • Windows Media Server and RealServer G2 have different log formats • Typical (common) fields: • Client IP address • Timestamp of the request • File name of the requested video • The advertised duration of video (in sec) • The size of requested file (in bytes) • The elapsed time of the requested media file when the play ended • The average bandwidth available to a client in Kb/sec (during the session) • Number of bytes sent by the server • Number of bytes received by the client, etc.
Media Sessions • Clients can pause, rewind, fast forward, skip using a slide bar • A session is a sequence of client requests corresponding to the same file access • Windows Media Server Logs contain a separate entry for each client request (a session = multiple requests) • RealServer log did not have this information
Summary Statistics In HPC, 471 files corresponded to live streams: we excluded them from further analysis
Files and Session Characteristics Distribution of stored videos and percentage of corresponding clientaccesses to those files 42% - short videos (less than 10 min) 23% - medium video group (10-30 min) 34% - long video (longer than 30 min)
11% - short videos (less than 10 min) 10% - medium video group (10-30 min) 79% - long video (longer than 30 min) Interesting observation: the client accesses are almost uniformly distributed across the 6 analyzed classes for both workloads This is a very useful property for synthetic workload generation.
Session Duration Characterization 77-79% of sessions were less than 10 min 7-12% of sessions were 10-30 min long 6-13% of sessions longer than 30 min. In spite of a significant difference in the type of content for both workloads (in terms of file duration distribution) the client viewing behaviors were almost identical for both workloads: browsing nature of client behavior
Client Interactivity Percentage of sessions with interactive requests for different file size classes. 99.9% of sessions with interactive requests were high-bandwidth sessions with available bandwidth greater than 56 Kb/s 15.3% of interactivity for short sessions, 22.6% - for medium sessions, 62.2% of sessions - for long sessions.
Encoding Rates and Available Bandwidth 59% of files encoded at 56Kb/s and lower. 1999 year: 1.7% of the files encoded at a rate between 128-256Kb/s 2001 year: 27.8% of the files encoded at a rate between 128-256Kb/s Most of the files and the corresponding average bandwidth available to the user show a good alignment.
67% of the files are encoded at 256Kb/s and higher. The gap between the demand and and available bandwidth per session is very high. The information provided by MediaMetrics could be used by service providers for choosing the right encoding rates.
Completed and Aborted Sessions • Completed sessions: • 29% for HPC • 12.6% for HPLabs • However, difference in bandwidth was not too much different between completed and aborted sessions. • Most of the aborted sessions accessed initial segments of media files. • Incompleted sessions accessing any other segment (other than beginning): • 1.5% in a short video group • 2.4% in a medium video group • 4-7% in a long video group
QoS Related Observations • Media access logs report • Number of bytes sent by the server • Number of bytes received by the client • MediaMetrix estimates the percentage of bytes lost during the file transfer to implicitly judge about QoS observed by the client • Lost bytes estimates produces useful results when data transmitted over UDP (HPC server is using UDP, HPLabs server -- TCP) • It might be less accurate for data transmitted over TCP: • in presence of congestion, media server will retransmit part of data to compensate for lost packets • the difference in server sent bytes and clients received bytes not always explicitly result in worse QoS (due to buffering on a client side) • Two groups of media sessions • low-bandwidth sessions (with available bandwidth less than 56 Kb/s) • high-bandwidth sessions (with available bandwidth greater than 56 Kb/s)
QoS Related Observations • HPC had 61% of high-bandwidth sessions • HPLabs had 23% of high-bandwidth sessions • High-bandwidth sessions transferred 4-6 times more bytes • HPC workload : QoS observed by low- and high-bandwidth sessions was practically the same: • 96.5% of low-bandwidth sessions had 0-5% of bytes loss per session • 97.1% of high-bandwidth sessions had 0-5% of bytes loss per session • HPLabs workload QoS : • 64.6% of low-bandwidth sessions had 0-5% of bytes loss per session • 88.8% of high-bandwidth sessions had 0-5% of bytes loss per session • It stresses the essential role of available bandwidth for media sessions over TCP
Locality Characterization Locality invariant for web server workloads: 10% of most popular files account for 90% of all requests and 90% of all bytes transferred HPC: 90% of media sessions target 14% of the files HPLabs: 90% of media sessions target 30% of the files HPC: sessions to 14% of most popular files transfer 94% of bytes HPLabs: sessions to 30% of most popular files transfer 92% of bytes Conclusion: locality invariant is applicable for media workloads too! HPC: 14% of the most popular files are accessed by 96% of clients HPLabs: 30% of the most popular files are accessed by 97% of clients
Locality from System Resource Usage Angle Let define active storage set as combined size of all the media files accessed in the logs 80% to 88% of sessions are to files that constitute only 20% of active storage set 82% to 92% of all transferred “most popular” bytes are to only 20% of active storage set These normalized metrics are useful to estimate storage requirements and potential bandwidth savings when designing or applying optimization technique
Zipf or Not a Zipf? Zipf-like distributions were observed for web servers and web proxies workloads as well as was reported in the recent study for media proxy workload the popularity of i-th most popular file is proportional to Distribution of the file access frequencies (file popularities) for entire duration of the log – not a Zipf! Question: does it depend on log duration?
Web servers: typical value of alpha varies varies between 1.4 – 1.6 Web proxies: typical value of alpha is less than 1, it varies varies between 0.64 to 0.83 Media proxies: alpha = 0.47 HPLabs media server: six month periods can be approximated with Zipf-like distribution and alpha=1.6
HPC media server: files popularity on a monthly basis can be aproximated with Zipf-like distribution and alpha=1.5 For different months, alpha varies between 1.4 to 1.6. These observations are very useful for synthetic workload generation.
File Sharing Statistics Both workloads exhibit high degree of clients’ file sharing access pattern! HPC: 70 most popular files are accessed by more than 1000 clients, with some most popular files accessed by 10,000-12,000 clients HPLabs: 17 most popular files are accessed by 113-341 unique clients
Rarely Accessed Files Statistics • These numbers are lower than compared to similar statistics for web server workloads • For web server workloads, “onetimers” may account for 20% to 40% of the files and 20% to 40% of the active storage
Dynamics and Evolutions of Media Sites Burstiness Some days exhibit two orders of magnitude higher number of sessions for both workloads For enterprise web server workloads, daily traffic amount is much more predictable Studies of educational media server workloads showed less degree of burstiness, more correlated with the day of the week
New Files Impact (HPC) We define a file being new if it was never accessed before (based on the information in access logs) Our intent: to observe the site’ dynamics and evolution due to new files HPC site has explicit growth trend with respect of total number of files accessed per month, and consistently steady amount of new files added to a site monthly.
New Files Impact (HPLabs) The growth of total number of files accessed each month for HPLabs is negative!? We asked the support team: any specific reasons? Suspicion was is there a significant number of files that “nobody watches”? Or the actual information of new media content on that site decreased over time? Team confirmed that only limited number of new files was added lately because of a transition plan to upgrade the entire site design and equipment So, the negative trend was observed correctly.
New Files Impact (Unique Clients) These graphs are again correlated with the trends of the sessions to new files! Conclusion: the number of new files added per month plays a crucial role in defining the site dynamics, evolution, and growth rates!
New Trends Over Time • Analysis of HPC workload over time revealed interesting overall trends in site media content and session characteristics • Total number of unique clients accessing media content in each 6 month duration doubled over the duration of our logs. • Total number of sessions in each 6 month duration also doubled over the duration of our logs. • Average file size in each 6 month duration increased from less than 7MB to more than 20MB in our logs. • Bytes transferred per session increased from just over 1MB to over 6MB in our logs.
New Files Impact (conclusion) • The access pattern of enterprise media servers resembles with the access patterns of new web sites: most of the client monthly accesses (50-80%) target newly added information. • Dynamics of enterprise web sites exhibits much more stability: only 2% of monthly requests are to the new files.
Life Span of File Accesses • Question:how much does the popularity of the file and frequency of accesses changes over time? • Enterprise media server workloads exhibit high locality of references: 90% of media sessions target only 14%-30% of the files • We define the core-90% as the set of most frequently accessed files that makes up for 90% of all the media sessions (it is performance critical set of files) • Life duration of a file : time between the first and the last accesses to this file in the considered workload.
Life Duration of the Files High percentage of short-lived files: HPC: 37% of all files live less than a month HPLabs: 50 % of all files live less than a month 73% of the files live less than 6 months for both workloads Only 8-10% of the files live longer than a year. Question: what is the density of accesses over time? The plotted histograms for most frequent files had lognormal-like curve with most accesses occurring during first 1-3 weeks after the files introduction.
Life Span Metric • Life span metric:cumulative distribution of accesses to the files • since their introduction at a site. • HPC HPLabs • First week: 52% 51% • Second week: 16% 10% • Third week: 6% 5% • 4th and 5th weeks: 3% 1%. • Enterprise media servers exhibit access patterns similar to news web sites: • most of accesses are to new documents, and • after certain time period these documents are accessed very rare
Rate of Change • Life span is normalized metric: the files could have been individually introduced at different times. • The metric reflects the rate of change of the files during their existence at the site. • Life span metric reflects timeliness of the introduced files: • Longer life span means that information at the site is less timely and has more consistent percentile of accesses over time. • Life span metric allows one to interpolate the intensity of the client accesses over time to the new and existing files over a future period of time.
Conclusion • Media server access logs are invaluable source of information about traffic access patterns and system resource requirements • MediaMetrics was specially designed for service providers and system administrators to understand nature of traffic to their media sites • Our analysis established a set of invariants specific for enterprise media servers workloads and compared them with well known related invariants and observations for web server workloads
Acknowledgments • Both tool and study would not have been possible without media access logs and help provided by • Nic Lyons, • Wray Smallwood, • Brett Bausk, • Magnus Karlsson, • Wenting Tang, • Yun Fu, • John Apostolopoulos, and • Susie Wee. • Their help is highly appreciated.