390 likes | 541 Views
Internet Measurement. CS 7260 Nick Feamster February 13, 2006. Administrivia. Project meetings: today and tomorrow (next Monday as needed) New problem set out Wednesday 3-4 (shorter) problems, 9 days. Today’s Lecture. Why measure? Troubleshooting and debugging
E N D
Internet Measurement CS 7260Nick FeamsterFebruary 13, 2006
Administrivia • Project meetings: today and tomorrow (next Monday as needed) • New problem set out Wednesday • 3-4 (shorter) problems, 9 days
Today’s Lecture • Why measure? • Troubleshooting and debugging • Incorporation into existing protocols • Performance monitoring • Science • What to measure? • Routing, traffic volumes, application-level statistics, paths, etc. • Pitfalls • Measurement wishlist
Traffic: Why Measure? • Billing • Measure usage on links to/from customers • Traffic engineering and capacity planning • Measure the traffic matrix (i.e., offered load) • Tune routing protocol or add new capacity • Troubleshooting • Anomaly detection (Lecture 12, next week)
Traffic Engineering/Capacity Planning • Need to estimate offered traffic loads BGP policy configuration Topology BGP routing model Offered traffic eBGP routes Flow of traffic through the network
Billing for Internet Usage • 95th Percentile billing • Customer network pays for “committed information rate” (CIR) • Throughput measured every 5 minutes (typically with SNMP; flow statistics also can be used for billing) • Customer billed based on 95th percentile
Passive vs. Active Measurement • Passive Measurement:Collection of packets, flow statistics of traffic that is already flowing on the network • Packet traces • Flow statistics • Application-level logs • Active Measurement: Inject “probing” traffic to measure various characteristics • Traceroute • Ping • Application-level probes (e.g., Web downloads)
Passive Traffic Data Measurement • SNMP byte/packet counts: everywhere • Packet monitoring: selected locations • Flow monitoring: typically at edges (if possible) • Direct computation of the traffic matrix • Input to denial-of-service attack detection • Deep Packet Inspection: also at edge, where possible
Simple Network Management Protocol • Management Information Base (MIB) • Information store • Unique variables named by OIDs • Accessed with SNMP • Specific MIBs for byte/packet counts (per link) SNMP Manager Agent ManagedObjects DB
SNMP (Passive) • Advantage: ubiquitous • Supported on all networking equipment • Multiple products for polling and analyzing data • Disadvantages: see Lecture 6 • Coarse granularity • Cannot express complex queries on the data • Unreliable delivery of the data using UDP • Utility • Link utilization (billing) • Traffic matrix inference
Packet-level Monitoring • Passive monitoring to collect full packet contents (or at least headers) • Advantages: lots of detailed information • Precise tming information • Information in packet headers • Disadvantages: overhead • Hard to keep up with high-speed links • Often requires a separate monitoring device
Full Packet Capture (Passive) Example:Georgia Tech OC3Mon • Rack-mounted PC • Optical splitter • Data Acquisition and Generation (DAG) card Source: endace.com
Packet-Level Monitoring http://optimusprime.noc.gatech.edu/current/ • Fine-grained timing and application-level information (port numbers, TCP flags, URLs, etc.)
Port-level information Example: Traffic to and from CoC • Note: lots of port 22 traffic (different from other GT subnets)
Detailed Traffic Accounting • Traffic monitoring by host • Daily summaries Planetlab In (MB) Out (MB) Total (MB) 199.77.129.53 26319.49 27064.46 53383.95 199.77.128.194 2278.77 18858.23 21137.00 199.77.129.73 5176.44 15797.16 20973.60 199.77.128.193 1321.03 11799.69 13120.71 Spamhaus
Traffic Flow Statistics • Flow monitoring (e.g., Cisco Netflow) • Statistics about groups of related packets (e.g., same IP/TCP headers and close in time) • Recording header information, counts, and time • More detail than SNMP, less overhead than packet capture • Typically implemented directly on line card
What is a flow? • Source IP address • Destination IP address • Source port • Destination port • Layer 3 protocol type • TOS byte (DSCP) • Input logical interface (ifIndex)
Core Network Cisco Netflow • Basic output: “Flow record” • Most common version is v5 • Current version (9) is being standardized in the IETF (template-based) • More flexible record format • Much easier to add new flow record types Collector (PC) Approximately 1500 bytes 20-50 flow records Sent more frequently if traffic increases Collection and Aggregation
Flow Record Contents • Source and Destination, IP address and port • Packet and byte counts • Start and end times • ToS, TCP flags Basic information about the flow… …plus, information related to routing • Next-hop IP address • Source and destination AS • Source and destination prefix
Aggregating Packets into Flows • Criteria 1: Set of packets that “belong together” • Source/destination IP addresses and port numbers • Same protocol, ToS bits, … • Same input/output interfaces at a router (if known) • Criteria 2: Packets that are “close” together in time • Maximum inter-packet spacing (e.g., 15 sec, 30 sec) • Example: flows 2 and 4 are different flows due to time flow 4 flow 1 flow 2 flow 3
Netflow Processing • Create and update flows in NetFlow Cache • Inactive timer expired (15 sec is default) • Active timer expired (30 min (1800 sec) is default) • NetFlow cache is full (oldest flows are expired) • RST or FIN TCP Flag • Expiration • Aggregation? No Yes e.g. Protocol-Port Aggregation Scheme becomes • Export Version Aggregated Flows – export Version 8 or 9 Non-Aggregated Flows – export Version 5 or 9 • Transport Protocol Export Packet Payload (flows) Header
Reducing Measurement Overhead • Filtering:on interface • destination prefix for a customer • port number for an application (e.g., 80 for Web) • Sampling: before insertion into flow cache • Random, deterministic, or hash-based sampling • 1-out-of-n or stratified based on packet/flow size • Two types: packet-level and flow-level • Aggregation: after cache eviction • packets/flows with same next-hop AS • packets/flows destined to a particular service
Packet Sampling • Packet sampling before flow creation (Sampled Netflow) • 1-out-of-m sampling of individual packets (e.g., m=100) • Create of flow records over the sampled packets • Reducing overhead • Avoid per-packet overhead on (m-1)/m packets • Avoid creating records for a large number of small flows • Increasing overhead (in some cases) • May split some long transfers into multiple flow records • … due to larger time gaps between successive packets time not sampled timeout two flows
Sampling: Flow-Level Sampling • Sampling of flow records evicted from flow cache • When evicting flows from table or when analyzing flows • Stratified sampling to put weight on “heavy” flows • Select all long flows and sample the short flows • Reduces the number of flow records • Still measures the vast majority of the traffic sample with 0.1% probability Flow 1, 40 bytes Flow 2, 15580 bytes Flow 3, 8196 bytes Flow 4, 5350789 bytes Flow 5, 532 bytes Flow 6, 7432 bytes sample with 100% probability sample with 10% probability
Traffic Engineering/Capacity Planning • Need to estimate offered traffic loads BGP policy configuration How to get this? Topology BGP routing model Offered traffic eBGP routes Flow of traffic through the network
Traffic Matrix Estimation From link counts to the traffic matrix Sources 5Mbps 3Mbps 4Mbps 4Mbps Destinations
Formalization • Source-destination pairs • p is a source-destination pair of nodes • xp is the (unknown) traffic volume for this pair • Links in the network • l is a unidirectional edge • yl is the observed traffic volume on this link • Routing • Rlp= 1 if link l is on the path for src-dest pair p • Or, Rlp is the proportion of p’s traffic that traverses l • y = Rx (now work back to get x)
Problem: Underconstrained • Linear system is underdetermined • Number of nodes n • Number of links eisaroundO(n) • Number of src-dest pairs c is O(n2) • Dimension of solution sub-space at least c - e • Multiple observations can help • k independent observations (over time) • Stochastic model with src-dest counts Poisson & i.i.d • Maximum likelihood estimation to infer traffic matrix • Use NetFlow to augment byte counts can help constrain the problem • Lots of recent work on traffic matrix estimation
Trend: “Deep Packet Inspection” • Intrusion detection capabilities in the router • Analysis of packet payloads • Recognition of application type • Analyzes and identifies application traffic in real time • Implemented with programmable hardware • Different types of analysis techniques • Signature-based • Rule-based • Statistical
Passive Traffic Data Measurement • SNMP byte/packet counts: everywhere • Packet monitoring: selected locations • Flow monitoring: Tracking the application mix • Direct computation of the traffic matrix • Input to denial-of-service attack detection
Active Measurement Tools • Send probes and measure a response • A few common tools: • Ping: RTT and loss • Traceroute: path and RTT • iPerf:generation of CBR flows for throughput estimation • Pathchar: per-hop bandwidth, latency, loss measurement • Pchar, clink: open-source reimplementation of pathchar • Nettimer (Lai): bottleneck bandwidth using packet pair method
Using Traceroute to Measure Paths • Send packets with increasing TTL values TTL=1 TTL=2 TTL=3 ICMP “time exceeded
Problems with Traceroute • Can’t unambiguously identify one-way outages • Failure to reach host : failure of reverse path? • ICMP messages may be filtered or rate-limited • IP address of “time exceeded” packet may be the outgoing interface of the return packet TTL=1 TTL=2 TTL=3
Routing: Why Measure? • Detect anomalies (e.g., route hijacks, routing loops, oscillations, etc.) • More likely: to help explain a problem observed with traffic.
Common Mode: iBGP “Monitor” • iBGP session to software box • Sees only the best route • Two types of logs: • Table dumps • BGP updates iBGP session
“CPR”: Network Measurement at Georgia Tech • Detection of non-catastrophic failures • Discovered only when users call to complain • “The Internet is slow this morning” • This probably doesn’t coincide with a warning in OpenView. • Little quantitative data to verify problems or determine the possible location of the issue.
Today’s Paper • Paxson, Strategies for Sound Internet Measurement • Lessons • Meta-data • Self-consistency checks • Examining the semantics of the data (e.g., TCP acknowledgements without data) • Analysis of small subsets of data (especially for large datasets) • Automated analysis of long-running measurements
End-to-End Routing Behavior • Routing loops • 3 hours vs. more than 12 hours (two modes) • Erroneous routing • Very longs paths • Connectivity altered mid-stream • Fluttering • Infrastructure failures • Two important stability notions • Prevalence • Persistence