120 likes | 359 Views
Measuring a (MapReduce) Data Center. Typical Data Center Network. IP Routers. ToR. Aggregation Switches. 24-, 48- port 1G to server, 10Gbps up ~ $7K. …. …. …. …. Top-of-rack Switch. …. …. Servers. Agg . Less bandwidth up the hierarchy Clunky routing
E N D
Typical Data Center Network IP Routers ToR Aggregation Switches 24-, 48- port 1G to server, 10Gbps up ~ $7K … … … … Top-of-rack Switch … … Servers Agg • Less bandwidth up the hierarchy • Clunky routing • e.g., VL2, BCube, FatTree, Portland, DCell Modular switch Chassis + up to 10 blades >140 10G ports $150K-$200K
What does traffic in a datacenter look like? Goal • A realistic model of data center traffic • Compare proposals How to measure a datacenter? (Macro-) Who talks to whom? Congestion, its impact (Micro-) Flow details: Sizes, Durations, Inter-arrivals, flux
How to measure? Servers Agg. Switches MapReduce Scripts + ToR Distr. FS Router = … … … … … … SNMP reports Use the end-hosts to share load • per port: in/out octets • sample every few minutes • miss server- or flow- level info • Auto managed already Packet Traces • Not native on most switches • Hard to set up (port-spans) Measured 1500 servers for several months Sampled NetFlow Tradeoff: CPU overhead on switch for detailed traces
Who Talks To Whom? 1Gbps .4 Gbps 3 Mbps 20 Kbps .2 Kbps 0 Server To Server From • Two patterns dominate • Most of the communication happens within racks • Scatter, Gather
Flows are small. 80% of bytes in flows < 200MB are short-lived. 50% of bytes in flows < 25s turnover quickly. median inter-arrival at ToR = 10-2s Flows which lead to… • Traffic Engineering schemes should react faster, few elephants • Localized traffic additional bandwidth alleviates hotspots
Congestion, its Impact • are links busy? • who are the culprits? • are apps impacted? Often! 1 .8 .6 .4 .2 0 Contiguous Duration of >70% link utilization (seconds)
Congestion, its Impact • are links busy? • who are the culprits? • are apps impacted? Often! Apps (Extract, Reduce) Marginally
Measurement Alternatives Link Utilizations (e.g., from SNMP) Server 2 Server Traffic Matrix Tomography + make do with easier-to-measure data – under-constrained problem heuristics gravity
Measurement Alternatives Link Utilizations (e.g., from SNMP) Server 2 Server Traffic Matrix Tomography + make do with easier-to-measure data – under-constrained problem heuristics gravity max sparse
Measurement Alternatives Link Utilizations (e.g., from SNMP) Server 2 Server Traffic Matrix Tomography + make do with easier-to-measure data – under-constrained problem heuristics gravity max sparse tomography + Job Information
a first look at traffic in a (map-reduce) data center • some insights • traffic stays mostly within high bandwidth regions • flows aresmall, short-lived and turnover quickly • net highly-utilized oftenwith moderate impact on apps. • measuring @ end-hosts is feasible, necessary (?) • → a model for data center traffic