180 likes | 276 Views
Measuring and monitoring Microsoft’s enterprise network. Richard Mortier (mort) , Rebecca Isaacs, Laurent Massouli é , Peter Key. We monitored our network…. …and this is how… …and this is what we saw… How did we monitor it? What did we see?. Microsoft CorpNet @ MSR Cambridge. CORPNET.
E N D
Measuring and monitoring Microsoft’s enterprise network Richard Mortier (mort), Rebecca Isaacs, Laurent Massoulié, Peter Key
We monitored our network… …and this is how… …and this is what we saw… • How did we monitor it? • What did we see?
Microsoft CorpNet @ MSR Cambridge CORPNET EMEA MSRC area3 area2 LatinAmerica area1 Area 0 eBGP NorthAmerica AsiaPacific
Capture setup • MSRC site organized using IP subnets • Roughly one per wing plus one for datacenter • Datacenter is by far the most active • Captured using VLAN spanning • 1:1 mapping between (Ethernet) VLAN and IP subnet • Mapped all VLANs to one port (NS trace)… • …except datacenter, mapped to second port (DC trace) • Also took a capture at one VLAN’s Ethernet switch • Allowed us to estimate amount of traffic not captured • >99% traffic is routed (i.e. goes ‘off-VLAN’) • Missed printer, some subnet broadcast, some SMB
Packet processing • Assigned packets to application • Used port numbers, RPC GUID, signature byte strings, server name • Assigned applications to category • ~40 applications ~10 categories • Generated packet and flow records • Reduce disk IO, increase performance • Still took ~10 days per complete run • Python scripts processed records
Problems with this setup • Duplication • No DC switch: some hosts directly connected to router • See their packets twice (on the way in and out) • Deduplicate both traces; careful selection from NS trace • IPSec transport mode deployment • Packet encapsulated in shim header plus trailer • IP protocol moved into trailer and header rewritten • Wrote custom capture tools to unpick encapsulation • Flow detection • Network flow ≠ transport flow ≠ application flow • Used IP 5-tuple and timeout = 90 seconds
# flows ~ # src ports suggesting client behaviour flows use few src ports suggests server behaviour neither client nor server suggests peer-to-peer neither client nor server suggests peer-to-peer
Traffic dynamics • Headlines: seasonal, highly volatile • Examine through • Autocorrelations • Variation per-application per-hour • Variation per-application per-host • Variation in heavy-hitter set
Variation per-application per-hour • Onsite (left) • Offsite (down) • Exponential decay • Light-tailed
Variation per-application per-host • Onsite (left) • Offsite (down) • Linear decay • Heavy-tailed • Heavy hitters
Implications for modelling • Timeseries modelling is hard • Tried ARMA, ARIMA models but per-application only • Exponentiation leads to large errors in forecasting • Client/server distinction unclear • Tried PCA, “projection pursuit method” • Neither found anything • PCA discovered singleton clusters in rank order...
Implications for endsystem measurement • Heavy hitter tracking a useful approach for network monitoring • Must be dynamic since heavy hitter set varies • between applications and • over time per-application • …but is it possible to define a baseline against which to detect (volume) anomalies?