630 likes | 715 Views
Automatic Computing and Inference. Thadpong (Ted) Pongthawornkamol CS598IG, Fall 2004. Motivation. System evolves over time Is it heading to the right way? What does “right” mean? Measurement & adjustment. What ‘s desired property? How to achieve?. actuator. system. sensor.
E N D
Automatic Computing and Inference Thadpong (Ted) Pongthawornkamol CS598IG, Fall 2004
Motivation • System evolves over time • Is it heading to the right way? • What does “right” mean? • Measurement & adjustment What ‘s desired property? How to achieve? actuator system sensor What do changes imply? What can be inferred? What characteristics observed? How changes are measured?
Papers Different goals for different problems in different scales • Correctness in a protocol • (Musuvathi & Engler) • Capacity, Efficiency in the Internet • CAIDA • Data availability & redundancy in distributed storage • (Bhagwan et al)
Model Checking Large Network Protocol Implementations M. Musuvathi, D. R. Engler Computer Systems Laboratory Stanford University
Goal • Network protocol implementations are • intractably large in number of states • difficult to find and correct bugs • Math model? • Not fine-grained enough • Only suitable for protocol specifications or small-sized implementations
Testing methods • User can help !!! • Easiest and simplest way • But … • Conventional approaches (simulator, real end-to-end test, trace analysis) • Good for small systems, not for large-scale network protocols • Reveal only common case bugs
Model checking • A well-known formal verification method • Explicit state model checker • View the system as a finite state machine • From initial state, perform nondeterministic search based on all possible input events • Check the validity in each state traversed • Repeat the process until… • All states are explored • The model checker runs out of resources
Example Model of mutual exclusion (borrowed from http://www.eecs.umich.edu/~vale/publications/Date04Tut.pdf)
CMC • A CModel Checker • An explicit model checker for real implementation code • Runs real protocol implementations within simulated environment (user model, network model, etc...) • Periodically checkpoint state • Enable backtracking to previous state • Avoid revisiting explored states • Previously successful to check AODV protocol implementations • Detected errors every few hundred lines of code • Discovered an error in AODV protocol specification itself • In this paper • Examine TCP protocol implementation in Linux kernel
TCP : the big guy • Basic but complex protocol • Internet standard protocol (bottleneck in hourglass model) • heavily tested and used in almost all distributed applications • Target TCP implementation • TCP code in Linux kernel version 2.4.19 • 50 K lines of code
Testing steps • User extracts core protocol from real implementation • make it able to run as a user process in CMC closed environment • User specifies state transitions in the system • Define all possible events from environment which can interact with the protocol system • User specifies correctness properties • Boolean • Possible range of values
1st attempt : “Keep it small” • Extract only TCP core module from kernel code • Implement stub code to provide kernel-supportedroutines to TCP module • Small model (just TCP core) • Few states CMC Client process CMC Server process Environment Environment stub implementation stub implementation TCP code TCP code Network (shared mem)
1st attempt : Failed • Too much interfaces required • 150 function calls between TCP module and stub implementation • Stub implementation grows bigger and contains bugs • Bugs in debugger !!! • False positives CMC Client process CMC Server process Environment Environment stub implementation stub implementation TCP code TCP code Network (shared mem)
2nd attempt : “Keep it clear” • Run entire kernel code (including TCP module) inside CMC • CMC support system calls and hardware abstraction routines • Bigger-sized system, more states • Using standard interfaces allows code reusable for future versions of TCP CMC Client process CMC Server process Environment Environment Real kernel Real kernel TCP code TCP code Network (shared mem)
Testing environment • Application thread makes socket related system calls (listen, connect, read, write) • Network thread simulates packet interrupts • Timer thread creates timer interrupt (i.e. timeout) CMC Client process CMC Server process Client application thread Server application thread Real kernel Real kernel TCP code TCP code timer thread Network thread Network thread timer thread Network (shared mem) CMC Checker CMC Checker
keeping states • Space • A hash table • Keeps states seen • Hash compaction reduce state size to 8 bytes • A queue • Keeps states seen but whose successor states have not yet been generated The state size for the TCP model
The checker restores the system state from the queue. The checker transfers control to one of enabled threads in the system. The current thread generates events or invokes system calls to TCP code The thread yields the control back to the checker, which will verify correctness, store new state into the hash table or the queue Go back to step 1 Running it CMC Client process Application thread 2 Real kernel 3 TCP code timer thread Network thread 2 2 CMC Checker 1
Optimization technique • Reduce the computation time & space • Incremental State Processing • Incremental Heap Canonicalization • Exploring Interesting Protocol Behaviors
Reduce the time • Incrementatal states • Most portions of a state remain unchanged (only TCP portion changes) • Breaking entire state into smaller objects • Processing only modified objects Time taken for a single CMC transition
“Thanksgiving” “Thanksgiving” “Happy” “Happy” “Break” “Break” Heap Canonicalization • A mechanism to prevent exploring redundant heap states which are different in bit-level • The test uses an improved version of heap canonicalization called “incremental heap canonicalization” buffer buffer
Exploring Interesting Protocol Behaviors • Techniques of pruning • Some state transitions are important, others are not (e.g. counter variable) • Only focus on interesting transitions • New state much differs from old state • Comparing concrete state (stack, heap, variable) with abstract state (LISTEN, CLOSED, etc.) • “Symmetry reduction” on abstract state generalize two superficially different abstract states into one group • Heuristic • A change on a rarely changed variable is more interesting than a change on frequently changed variable. • Eliminate changes from most counter variables
Checking correctness • Two type of correctness • Specification • Likely to be correct in long-time used TCP protocol • Implementation • Likely to contain bugs • Focused • Checking considerations • Protocol conformance • Running & checking implementation along with a small-sized reference model • False positives possible (due to protocol ambiguities) • Resource leaks • Checking available resources after closing TCP connection • Some resources might not be freed (such as caches, stats) • Implementation robustness • Considers malicious network, packet mutation, modified checksum
Bugs found • Four bugs found • Failure to acknowledge RST packet in SYN_RCVD state when ACK bit is not set • Leading to resource lockup after closing connection • Incorrect handling duplicate SYN_ACK packet • Duplicate packet should be ignored • Dropping connection without notifying peer after premature closing by application in SYN_RCVD state • Peer has to wait until time-out • Peculiar behavior when application aborts the connection just before data transfer is complete. • Reply data follow by RST packet • One protocol ambiguity found • FIN packet transmission on a zero window • Queues a FIN packet, even no data is queued • Expected behavior in specification
Coverage • The portion of the implementation & model run in the test • The test covers most portion of the implementation and almost all in the model • Detect & eliminate almost bugs (?)
Conclusion • CMC model checker • Model checker for real protocol implementation • Running real implementation on simulated environment • Nondeterministic traversal & check on protocol state spaces • Several optimizations such as incremental states, heap canonicalization • Proved to find bug on real Linux TCP implementation • Questions?
Discussion • Low set up cost? • Another reference model needed • Might contains bugs for larger system test • Cross checking between implementation suggested • Need to simulate specific environment • Good coverage? • Pruning may cut off some possible bugs • Generic model? • Many explicit differences between two tests (AODV & TCP) • Tradeoff between Accuracy & Generality • P2P-related? • Can be used to debug group communication (i.e. multicast) ? • Can be run on distributed testbed (grid , P2P) ?
CAIDA : The Cooperative Association for Internet Data Analysis http://www.caida.org (some parts of slides are brought from CAIDA’s presentation & website)
CAIDA? What is it? “a collaborative undertaking among organizations with a strong interest in keeping primary Internet capacity and usage efficiency in line with ever-increasing demand”
Analysis Active: macroscopic topology project Passive: (real-time) traffic workload characterization DNS analysis Routing analysis and modeling Performance/bandwidth estimation methods and tools Internet Measurement Data Catalogue (IMDC) Security issues Tools development Workload CoralReef topology Skitter Gtrace Data management NetGeo Utilities RRDTool Visualization tools Walrus Current works
Skitter • Active probing tool for topology and performance analysis • The same procedure as traceroute • Incremental ICMP echo requests • Increase TTL value every requests • Collect topology • Determine RTT from source to destination • Only single run between source to destination • Aggressive probe throughout the network • Not available for download http://www.caida.org/tools/measurement/skitter/
CoralReef • Passive network measurement tool • drivers • libraries • utilities • analysis software • Analyze data from passive Internet traffic monitors • Ethernet or FDDI • In real time or from trace file http://www.caida.org/tools/measurement/coralreef/
visit : http://www.caida.org/outreach/resources/animations/ for more insight in Active probing (Skitter) and Passive monitoring (CoralReef)
GTrace • Traceroute with Geographic information • Result map can be zoomed to different levels • Supports the addition of new maps and non-geographical network diagrams • Various method to determine the physical location of each node in the path • Need NetGeo service for mapping IP addresses to location http://www.caida.org/tools/visualization/gtrace/
RRD Tool • Round robin database • Store & display time-series data • Network bandwidth • CPU workload • Memory usage • Display data in graph format http://www.caida.org/tools/utilities/rrdtool/
Walrus • 3-D interactive visualization tool for large directed graphs • Best suited to visualizing moderately sized graphs that are nearly trees • Up to few thousands nodes • Display graph which layout is based on a user-supplied spanning tree • Nodes near the center are magnified, while those near the boundary are shrunk • Provide global view of the system http://www.caida.org/tools/visualization/walrus/
Conclusion & Discussion • CAIDA • An organization • Measurement & analysis on the Internet’s behavior • Focus on capacity & efficiency of the Internet • Measurement Tools • Topology • Skitter • Gtrace • Workload • CoralReef • Visualization • Walrus • Utilities • RRD tool
Total Recall: System Support for Automated Availability Management R. Bhagwan, K. Tati, Y. Cheng, S. Savage, G. M. Voelker Department of Computer Science and Engineering University of California, San Diego
Problems • Consider distributed storage system • Physical node & link failures may cause some data to be unavailable • Periodic node joins and leaves • Short-term • Nodes leave & come back • Long-term • New members join & old members leave • Duplication can help, but… • How & How much data get replicated depend on • Expected availability • File size • Current network size & status • Trade off between performance & efficiency • Frequent changes in the system may change the appropriate policy over time
CFS (Cooperative File System) File System (CFS) User interface File & directory operation Block management Simple replication / repair Quota enforcement Block Store (DHash) DHT (Chord) Block lookup Overlay routing
Redundancy module Plug-in for distributed storage Retain data availability at user-specified level Handle system dynamism by periodic policy prediction & dynamic repair Automated process without human interaction Total Recall File System (TRFS) Storage Manager (Total Recall) Block Store (DHash) DHT (Chord)
Architecture How to create/ repair a file? When to repair a file? File System (TRFS) Availability Monitor Policy Module Storage Manager (Total Recall) Redundancy Engine Block Store (DHash) How many copies should be created? DHT (Chord)
Availability Monitor • Predict & maintain two values • Short-term host availability • The minimum average of all tracked hosts available in the past 24 hours • Used when a file is created • Long-term decay rate • Host departure rate over days and weeks • Distinguish permanent departure from transient departure • Used to trigger repair
Redundancy Engine • Triggered by • Availability Monitor when repair is needed • Policy Module when a file is created or updated • Invokes redundancy mechanism • Replication • Erasure coding
Replication Data Distributed Storage Data Data • Simple replication Given a target level of availability A, a mean host availability μH, number of required replicas c A = 1 – (1- μH)c • Stripping + Replication (for load balancing) data is divided into b pieces, each piece is replicated to c copies A = (1 – (1 – μH)c)b
Erasure Coding . . . . . Stripped into b blocks Data erasure coding into cb blocks • At least b out of bc blocks are required to reconstruct the data • Extremely efficient • c (storage overhead) is small than simple replication • Quadratic coding time • read,write takes longer time
Policy Module • Invoked by clients when a file is created or updated • Determine the best strategy for maintaining stored data • Redundancy mechanism • Repair policy • Number of blocks (b) used to store data • Typical policy • Replication + Eager repair for small files & inodes • Erasure coding + Lazy repair for large files