1 / 42

The Nature of Datacenter: measurements & analysis

The Nature of Datacenter: measurements & analysis. Srikanth Kandula , Sudipta Sengupta, Albert Greenberg, Parveen Patel, Ronnie Chaiken Microsoft Research IMC November, 2009 Abhishek Ray raya@cs.ucr.edu. Outline. Introduction Data & Methodology Application Traffic Characteristics

Mia_John
Download Presentation

The Nature of Datacenter: measurements & analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Nature of Datacenter: measurements & analysis SrikanthKandula, Sudipta Sengupta, Albert Greenberg, Parveen Patel, Ronnie Chaiken Microsoft Research IMC November, 2009 Abhishek Ray raya@cs.ucr.edu

  2. Outline • Introduction • Data & Methodology • Application • Traffic Characteristics • Tomography • Conclusion

  3. Introduction • Analysis and mining of data sets • Processing around some petabytes of data • This paper has tried to describe characteristics of traffic • Detailed view of traffic • Congestion conditions and patterns

  4. Contribution • Measurement Instrumentation • Measures traffic at data centers rather than switches • Traffic characteristics • Flow, congestion and rate of change of traffic mix. • Tomography Inference Accuracy • Performs • Clusters =1500 servers • Rack = 20 2 months

  5. Data & Methodology • ISPs • SNMP Counters • Sampled Flow • Deep packet Inspection • Data Center • Measurements at Server • Servers, Storage and network • Linkage of network traffic with application level logs

  6. Socket level events at each servers • ETW – Event Tracing for Windows • One per application read or write Aggregates over several packets http://msdn.microsoft.com/en-us/magazine/cc163437.aspx#S1

  7. ETW – Event tracing for Windows

  8. Application Workload • SQL Programming language like Scope • 3 phases of different types • Extract • Partition • Aggregate • Combine • Short interactive programs to long running programs

  9. Traffic Characteristics

  10. Patterns

  11. Work-Seeks-BW and Scatter-Gather patterns in datacenter traffic exchanged b/w server pairs

  12. Work-seeks-bandwidth • Within same servers • Within servers in same rack • Within servers in same VLAN • Scatter-gather-patterns • Data is divided into small parts and each servers works on particular part • Aggregated

  13. How much traffic is exchanged between server pairs?

  14. Server pair with same rack are more likely to exchange more bytes • Probability of exchanging no traffic • 89% - servers within same rack • 99.5% - servers in different rack

  15. How many other servers does a server correspond with?

  16. Sever either talks to all other servers with the same rack • Servers doesn’t talk to servers outside the rack or talks 1-10% outside servers.

  17. Congestion within the Datacenter

  18. N/W at as high an utilization as possible without adversely affecting throughput • Low network utilization indicate • Application by nature demands more of other resources such as CPU and disk than the network • Applications can be re-written to make better use of available network bandwidth

  19. Where and when the congestion happens in data center

  20. Congestion Rate • 86% - 10 seconds • 15% - 100 seconds • Short congestion periods are highly correlated across many tens of links and are due to brief spurts of high demand from the application • Long lasting congestion periods tend to be more localized to a small set of links

  21. Length of Congestion Events

  22. Compares the rates of flows that overlap high utilization periods with the rates of all flows

  23. Impact of high utilization

  24. Read failure - Job is killed • Congestion • To attribute network traffic to the applications that generate it, they merge the network event logs with logs at the application-level that describe which job and phase were active at that time

  25. Reduce phase - Data in each partition that is present at multiple servers in the cluster has to be pulled to the server that handles the reduce for the partition • e.g. count the number of records that begin with ‘A’ • Extract phase – Extracting the data • Largest amount of data • Evaluation phase – Problem • Conclusion – High utilization epochs are caused by application demand and have a moderate negative impact to job performance

  26. Flow Characteristics

  27. Traffic mix changes frequently

  28. How traffic changes over time within the data center

  29. Change in traffic • 10th and 90th percentiles are 37% and 149% • the median change in traffic is roughly 82% • even when the total traffic in the matrix remains the same, the server pairs that are involved in these traffic exchanges change appreciably

  30. Short bursts cause spikes at the shorter time-scale (in dashed line) that smooth out at the longer time scale (in solid line) whereas gradual changes appear conversely, smoothed out at shorter time-scales yet pronounced on the longer time-scale • Variability - key aspect for data center

  31. Inter-arrival times in the entire cluster, at Top-of-Rack switches and at servers

  32. Inter-arrivals at both servers and top-of-rack switches have spaced apart by roughly 15ms • This is likely due to the stop-and-go behavior of the application that rate-limits the creation of new flows • Median arrival rate of all flows in the cluster is 105 flows per second or 100 flows in every millisecond

  33. Tomography • N/W tomography methods to infer traffic matrices • If the methods used in ISP n/w is applicable to datacenters, it would help to unravel the nature of traffic • Why? • Data flow volume is quadratic n(n - 1) – no. of links measurements are fewer • Assumptions - Gravity model - Amount of traffic a node (origin) would send to another node (destination) is proportional to the traffic volume received by the destination • Scalability

  34. Methodology • Computes ground truth TM and measure how well the TM estimated by tomography from these link counts approximates the true TM

  35. Tomogravity and Spare Maximization

  36. Tomogravity - Communication likely to be B/W nodes with same job rather than all nodes, whereas gravity model, not being aware of these job-clusters, introduces traffic across clusters, resulting in many non-zero TM entries • Spare maximization – Error rate starts from several hundreds

  37. Comparison the TMs by various tomography methods with the ground truth

  38. Ground TMs are sparser than tomogravity estimated TMs, and denser than sparsity maximized estimated TMs

  39. Conclusion • Capture both • Macroscopic patterns – which servers talk to which others, when and for what reasons • Microscopic characteristics – flow durations, inter-arrival times • Tighter coupling between network, computing, and storage in datacenter applications • Congestion and negative application impact do occur, demanding improvement - better understanding of traffic and mechanisms that steer demand

  40. My Take • More data should be examined over a period of 1 year instead of 2 months • I would certainly like to see some mining of data and application running at datacenters of companies like Google, Yahoo etc

  41. Related Work • T. Benson, A. Anand, A. Akella, andM. Zhang: Understanding Datacenter Traffic Characteristics, In SIGCOMMWREN workshop, 2009. • A. Greenberg, N. Jain, S. Kandula, C. Kim, P. Lahiri, D. Maltz, P. Patel, and S. Sengupta: VL2: A Scalable and Flexible Data Center Network, In ACM SIGCOMM, 2009.

  42. Thank You

More Related