性能评价技术 : 实验 - 测量，解析，仿真 / 模拟

性能评价技术: 实验-测量，解析，仿真/模拟 • 实验/测量 (measurement) 技术：通过测量设备或测量程序（软件）直接测量计算机系统的各种性能指标，或与之相关的量，然后由它们经过运算求出相应的性能的指标。 • 模型/建模 (modeling) 技术：对评价的计算机系统建立一个适当的模型，然后求出模型的性能指标，以便对计算机系统进行评价，该技术又分为解析技术和仿真技术两种。 • 解析 (analysis) 技术是采用数学分析方法，通过对系统的简化及解析模型的建立，以求得系统的性能。 • 仿真 (simulation) 技术是采用软件仿真原理，通过构造仿真模型，详尽、逼真地描述计算机系统。当模型按照系统本身的方式运行时，对系统的动态行为进行统计，从而得到有关的性能指标。 • 测量、解析、仿真之间，相互联系，相互验证，各有优缺。 • 模拟（emulation）---- simulation + experiment

Why Measure? • The Internet is a man-made system, so why do we need to measure it? • Because we still don’t really understand it • Because sometimes things go wrong • Analyze/characterize network phenomena • Measurement for network operations • Detecting and diagnosing problems • What-if analysis of future changes • Measurement for scientific discovery • Characterizing a complex system as organism • Creating accurate models that represent reality • Identifying new features and phenomena • Test new tools, protocols, systems, etc.

Why Build Models of Measurements? • Compact summary of measurements • Efficient way to represent a large data set • E.g., exponential distribution with mean 100 sec • Expose important properties of measurements • Reveals underlying cause or engineering question • E.g., mean RTT to help explain TCP throughout • Generate random but realistic data as input • Generate new data that agree in key properties • E.g., topology models to feed into simulators “All models are wrong, but some models are useful.” – George Box

What Can be Measured? • Traffic • Packet or flow traces • Load statistics • Performance of paths • Application performance, e.g,. Web download time • Transport performance, e.g., TCP bulk throughput • Network performance, e.g., packet delay and loss • Network structure • Topology, and paths on the topology • Dynamics of the routing protocol • Performance Metrics • Throughput, Latency, Response time, Loss, Utilization, Arrival rate, Bandwidth, Routing （hop）, Reliability

Sample Question: Topology? • What is the topology of the network? • At the IP router layer • Without “inside” knowledge or official network maps • Without SNMP or other privileged access • Why do we care? • Often need topologies for simulation and evaluation • Intrinsic interest in how the Internet behaves • “But we built it! We should understand it” • Emergent behavior; organic growth

Where Measure? • Short answer • Anywhere you can!  • End hosts • Sending active probes to measure performance • Application logs, e.g., Web server logs • Individual links/routers • Load statistics, packet traces, flow traces • Configuration state • Routing-protocol messages or table dumps • Alarms

Internet Challenges Make Measurement an Art • Stateless routers • Routers do not routinely store packet/flow state • Measurement is an afterthought, adds overhead • IP narrow waist • IP measurements cannot see below network layer • E.g., link-layer retransmission, tunnels, etc. • Violations of end-to-end argument • E.g., firewalls, address translators, and proxies • Not directly visible, and may block measurements • Decentralized control • Autonomous Systems may block measurements • No global notion of time

Active Measurement: Ping • Adding traffic for purposes of measurement • Send probe packet(s) into the network and measure a response • Trade-offs between accuracy and overhead • Need careful methods to avoid introducing bias • Ping: RTT and connectivity • Host sends an ICMP ECHO packet to a target • … and captures the ICMP ECHO REPLY • Useful for checking connectivity, and RTT • Only requires control of one of the two end-points • Problems with ping • Round-trip rather than one-way delays • Some hosts might not respond

Time exceeded TTL=1 destination source TTL=2 Active Measurement: Traceroute • Traceroute: path and RTT • TTL (Time-To-Live) field in IP packet header • Source sends a packet with a TTL of n • Each router along the path decrements the TTL • “TTL exceeded” sent when TTL reaches 0 • Traceroute tool exploits this TTL behavior • Send packets with increasingTTL values Send packets with TTL=1, 2, 3, … and record source of “time exceeded” message

Problems with Traceroute • Round-trip vs. one-way measurements • Paths may have asymmetric properties • Can’t unambiguously identify one-way outages • Failure to reach host : failure of reverse path? • Returns IP address of interfaces, not routers • Routers have multiple interfaces • IP address of “time exceeded” packet may be the outgoing interface of the return packet • Non-participating network elements • Some routers and firewalls don’t reply • ICMP messages “TTL exceeded” may be filtered or rate-limited • Inaccurate delay • including processing delays on the router

Famous Traceroute Pitfall • Question: What ASes does traffic traverse? • Strawman approach • Run traceroute to destination • Collect IP addresses • Use “whois” to map IP addresses to AS numbers • Thought Questions • What IP address is used to send “time exceeded” messages from routers? • How are interfaces numbered? • How accurate is whois data?

Less Famous Traceroute Pitfall • Measuring multiple paths • Host sends out a sequence of packets • Successive probes may traverse different paths • Each has a different destination port • Load balancers send probes along different paths • Question: Why won’t just setting same port number work?

Applications of Traceroute • Network troubleshooting • Identify forwarding loops and black holes • Identify long and convoluted paths • See how far the probe packets get • Network topology inference • Launch traceroute probes from many places • … toward many destinations • Join together to fill in parts of the topology • … though traceroute undersamples the edges

Active Measurement: Pathchar for Links ---- per-hop capacity, latency, loss rtt(i+1) -rtt(i) Three delay components:  min. RTT (L) slope=1/c d How to infer d,c? L

Passive Measurement • Passive Measurement • Capture data as it passes by • Two Main Approaches • Packet-level Monitoring • Keep packet-level statistics • Examine (and potentially, log) variety of packet-level statistics. Essentially, anything in the packet • Timing • Flow-level Monitoring • Monitor packet-by-packet (though sometimes sampled) • Keep aggregate statistics on a flow

Passive Measurement: Logs at Hosts • Web server logs • Host, time, URL, response code, content length, … • E.g., 122.345.131.2 - - [15/Oct/1998:00:00:25 -0400] "GET /images/wwwtlogo.gif HTTP/1.0" 304 - "http://www.aflcio.org/home.htm" "Mozilla/2.0 (compatible; MSIE 3.02; Update a; AK; AOL 4.0; Windows 95)" "-" • DNS logs • Request, response, time • Useful for workload characterization, troubleshooting, etc.

Passive Measurement: SNMP • Simple Network Management Protocol (SNMP) • Get # of packets across interface per 5 min or other similar very coarse states -- • Coarse-grained counters on the router • E.g., byte and packet counts • Polling • Management system can poll the counters • E.g., once every five minutes • Advantages: ubiquitous • Limitations • Extremely coarse-grained statistics • Delivered over UDP!

Host A Host C Monitor Host B Monitor Monitor Passive Measurement: Packet Monitoring • Tapping a link Multicast switch Shared media (Ethernet, wireless) S w i t c h Host A Host B Splitting a point-to-point link Line card that does packet sampling Router B Router A Router A

Packet Monitoring: Selecting the Traffic • Filter to focus on a subset of the packets • IP addresses/prefixes (e.g., to/from specific Web sites, client machines, DNS servers, mail servers) • Protocol (e.g., TCP, UDP, or ICMP) • Port numbers (e.g., HTTP, DNS, BGP, Napster) • Collect first n bytes of packet (snap length) • Medium access control header (if present) • IP header (typically 20 bytes) • IP+UDP header (typically 28 bytes) • IP+TCP header (typically 40 bytes) • Application-layer message (entire packet)

Packet Capture: tcpdump/bpf • Put interface in promiscuous mode • Use bpf (Berkeley packet filter) to extract packets of interest • Accuracy Issues • Packets may be dropped by filter • Failure of tcpdump to keep up with filter • Failure of filter to keep up with dump speeds • Question: How to recover lost information from packet drops?

Tcpdump Output(three-way TCP handshake and HTTP request message) timestamp client address and port # Web server (port 80) 23:40:21.008043 eth0 > 135.207.38.125.1043 > lovelace.acm.org.www: S617756405:617756405(0) win 32120 <mss 1460,sackOK,timestamp 46339 0,nop,wscale 0> (DF) SYN flag TCP options sequence number 23:40:21.036758 eth0 < lovelace.acm.org.www > 135.207.38.125.1043: S 2598794605:2598794605(0) ack 617756406 win 16384 <mss 512> 23:40:21.036789 eth0 > 135.207.38.125.1043 > lovelace.acm.org.www: . 1:1(0) ack 1 win 32120 (DF) 23:40:21.037372 eth0 > 135.207.38.125.1043 > lovelace.acm.org.www: P 1:513(512) ack 1 win 32256 (DF) 23:40:21.085106 eth0 < lovelace.acm.org.www > 135.207.38.125.1043: . 1:1(0) ack 513 win 16384 23:40:21.085140 eth0 > 135.207.38.125.1043 > lovelace.acm.org.www: P 513:676(163) ack 1 win 32256 (DF) 23:40:21.124835 eth0 < lovelace.acm.org.www > 135.207.38.125.1043: P 1:179(178) ack 676 win 16384

Analysis of Packet Traces • IP header • Traffic volume by IP addresses or protocol • Burstiness of the stream of packets • Packet properties (e.g., sizes, out-of-order) • TCP header • Traffic breakdown by application (e.g., Web) • TCP congestion and flow control • Number of bytes and packets per session • Application header • URLs, HTTP headers (e.g., cacheable response?) • DNS queries and responses, user key strokes, …

Aggregating Packets into IP Flows • Set of packets that “belong together” • Source/destination IP addresses and port # • Same protocol, ToS bits, … • Same input/output interfaces at a router • Packets that are “close” together in time • Maximum spacing between packets (e.g., 15 sec, 30 sec) • Example: flows 2 and 4 are different flows due to time flow 4 flow 1 flow 2 flow 3

Traffic Flow Statistics • Flow monitoring (e.g., Cisco Netflow) • Statistics about groups of related packets (e.g., same IP/TCP headers and close in time) • Records header information, counts, and time • May be sampled • Flow Record Contents • Basic information about the flow…… • Source and Destination, IP address and port • Packet and byte counts • Start and end times • ToS, TCP flags • plus, information related to routing • Next-hop IP address • Source and destination AS • Source and destination prefix

Why Trust Your Data? • Measurement requires a degree of suspicion • Why should I trust your data? Why should you? • Resolving that... • Use current best practices • e.g., paris-traceroute, CAIDA topologies, etc. • Don't trust the data until forced to • Sanity checks and cross-validation • Spot checks (when applicable)

Strategy: Examine the Zeroth-Order • Paxson calls this “looking at spikes and outliers” • More general: Look at the data, not just aggregate statistics • Tempting/dangerous to blindly compute aggregates • Timeseries plots are telling (gaps, spikes, etc.) • Basics • Are the raw trace files empty? • Need not be 0-byte files (e.g., BGP update logs have state messages but no updates) • Metadata/context: Did weird things happen during collection (machine crash, disk full, etc.)

Strategy: Sanity Checks & Cross-Validation • Paxson breaks cross validation into two aspects • Self-consistency checks (and sanity checks) • Independent observations (Looking at same phenomenon in multiple ways) • Example Sanity Checks • Exploiting additional properties of the measured phenomenon • E.g., TCP: reliability, ACK cumulative (packet drop measurement problem) • Is time moving backwards? • Typical cause: clock synchronization issues • Has the speed of light increased? • E.g., 10ms cross-country latencies • Do values make sense? • IP addresses like 0.0.1.2 indicate bug

Cross-Validation Example • Traceroutes captured in parallel with BGP routing updates • Puzzle • Route monitor sees route withdrawal for prefix • Routing table has no route to the prefix • IP addresses within prefix still reachable from within the IP address space (i.e., traceroute goes through) • Why? • Collection bugs … or • Broken mental model of routing setup: A default route!

Measurement Challenges for Operators • Network-wide view • Crucial for evaluating control actions • Multiple kinds of data from multiple locations • Large scale • Large number of high-speed links and routers • Large volume of measurement data • Poor state-of-the-art • Working within existing protocols and products • Technology not designed with measurement in mind • The “do no harm” principle • Don’t degrade router performance • Don’t require disabling key router features • Don’t overload the network with measurement data

Network Operations Tasks • Reporting of network-wide statistics • Generating basic information about usage and reliability • Performance/reliability troubleshooting • Detecting and diagnosing anomalous events • Security • Detecting, diagnosing, and blocking security problems • Traffic engineering • Adjusting network configuration to the prevailing traffic • Capacity planning • Deciding where and when to install new equipment

Basic Reporting • Producing basic statistics about the network • For business purposes, network planning, ad hoc studies • Examples • Proportion of transit vs. customer-customer traffic • Total volume of traffic sent to/from each private peer • Mixture of traffic by application (Web, Napster, etc.) • Mixture of traffic to/from individual customers • Usage, loss, and reliability trends for each link • Requirements • Network-wide view of basic traffic and reliability statistics • Ability to “slice and dice” measurements in different ways (e.g., by application, by customer, by peer, by link type)

Troubleshooting • Detecting and diagnosing problems • Recognizing and explaining anomalous events • Examples • Why a backbone link is suddenly overloaded • Why the route to a destination prefix is flapping • Why DNS queries are failing with high probability • Why a route processor has high CPU utilization • Why a customer cannot reach certain Web sites • Requirements • Network-wide view of many protocols and systems • Diverse measurements at different protocol levels • Thresholds for isolating significant phenomena

Security • Detecting and diagnosing problems • Recognizing suspicious traffic or disruptions • Examples • Denial-of-service attack on a customer or service • Spread of a worm or virus through the network • Route hijack of an address block by adversary • Requirements • Detailed measurements from multiple places • Including deep-packet inspection, in some cases • Online analysis of the data • Installing filters to block the offending traffic

Traffic Engineering • Adjusting resource allocation policies • Path selection, buffer management, and link scheduling • Examples • OSPF weights to divert traffic from congested links • BGP policies to balance load on peering links • Link-scheduling weights to reduce delay for “gold” traffic • Requirements • Netwrk-wide view of the traffic carried in backbone • Timely view of the network topology and config • Accurate models to predict impact of control operations (e.g., the impact of RED parameters on TCP throughput)

Capacity Planning • Deciding whether to buy/install new equipment • What? Where? When? • Examples • Where to put the next backbone router • When to upgrade a link to higher capacity • Whether to add/remove a particular peer • Whether the network can accommodate a new customer • Whether to install a caching proxy for cable modems • Requirements • Projections of future traffic patterns from measmnt • Cost estimates for buying/deploying new equipmnt • Model of the potential impact of the change (e.g., latency reduction and bandwidth savings from a caching proxy)

Examples of Public Data Sets • Network-wide data • Abilene and GEANT backbones • Netflow, IGP, and BGP traces • CAIDA DatCat • Data catalogue maintained by CAIDA • http://imdc.datcat.org/ • Interdomain routing • RouteViews and RIPE-NCC • BGP routing tables and update messages • Traceroute and looking glass servers • http://www.traceroute.org/ • http://www.nanog.org/lookingglass.html

PlanetLab for Network Measurement • Nodes are largely at academic sites • Other alternatives: RON testbed (disadvantage: smaller, less software support) • Repeatability of network experiments is tricky • Proportional sharing • Minimum guarantees provided by limiting the number of outstanding shares • Work-conserving CPU scheduler means experiment could get more resources if there is less contention

性能评价技术 : 实验 - 测量，解析，仿真 / 模拟

性能评价技术 : 实验 - 测量，解析，仿真 / 模拟

Presentation Transcript