430 likes | 615 Views
Static Path-Aware Analysis of Program Invariants. Murali Krishna Ramanathan Department of Computer Science Purdue University (joint work with Suresh Jagannathan and Ananth Grama). How do I use this?. Motivation. Undocumented Program. Expert Programmer. New Programmer. BUGS. Tester.
E N D
Static Path-Aware Analysis of Program Invariants Murali Krishna Ramanathan Department of Computer Science Purdue University (joint work with Suresh Jagannathan and Ananth Grama)
How do I use this? Motivation Undocumented Program Expert Programmer New Programmer BUGS Tester
Context • What is a program invariant? • Property that must hold across all program executions • What is a failure? • Program run does not satisfy an expected invariant • System crashes • Logical bugs • Performance bugs • What is a specification? • Documentation of intended program invariants • e.g., lock must be followed by unlock • Unavailable or imprecise
Issues • Deriving specifications • Where do we start? • Absence of formal documentation • Legacy code • Identifying the source of failures • How do we search? • Exponential number of execution paths to explore • Representing common information among paths
Specification Inference • Challenges • What to look for? • Both relevant and irrelevant information present in the program source • How to be robust in the presence of bugs? • Assumptions • Programs are mostly well tested but can have bugs • Transparent – no programmer annotations
Kinds of specifications • Control-flow preconditions • A call to fopen must always precede a call to fgets • Data-flow preconditions • The result of a call to socket must always be checked for error before a call to bind • Control-flow postconditions • A call to fopen is either followed by a call to fclose or error • Control-flow divergence preconditions • A call to read can be preceded either by a call to open or socket • …
Preconditions fp := fopen(…) fp = fopen(…); if(fp != NULL) fgets(buf, SIZE, fp); • Predicate • Captures properties associated with variables and procedure calls • Preconditions for procedure • Composed of predicates that need to hold always before every call to a procedure fp != null fopen <- fgets
Types of predicates fp = fopen(…); if(fp != NULL) fgets(buf, SIZE, fp); • Data-flow • captures data flow properties associated with variables • fp is assigned the return of fopen, fp is not null, • Control-flow • define precedence properties among procedures • fgets is preceded by fopen
Control-flow preconditions (ICSE 07) 181 RI_FKey_check(PG_FUNCTION_ARGS) 182 { 199 ri_CheckTrigger(...); 210 pk_rel = heap_open(...); 296 match_type = ri_DetermineMatchType(...); 303 ri_BuildQueryKeyFull(...); 437 } “Check that RI trigger function was called in expected context” “Get the relation descriptors of the FK and PK tables…” “Convert the MATCH TYPE string into a switchable int” “Build up a new hashtable key for a prepared SPI Plan of a constraint trigger of MATCH FULL …”
Control-flow preconditions 181 RI_FKey_check(PG_FUNCTION_ARGS) 182 { 199 ri_CheckTrigger(...); 210 pk_rel = heap_open(...); 212 if(TRIGGER_FIRED_BY_UPDATE(...)) ... 218 else ... 231 if(!HeapTupleSatisfies(...)) ... 296 match_type = ri_DetermineMatchType(...); 298 if(match_type==RI_MATCH_TYPE_PARTIAL) 299 ereport(...); 303 ri_BuildQueryKeyFull(...); 437 }
Control-flow preconditions 181 RI_FKey_check(PG_FUNCTION_ARGS) 182 { 199 ri_CheckTrigger(...); 210 pk_rel = heap_open(...); 248 if (tgnargs == 4) 249 { 250 ri_BuildQueryKeyFull(...); 294 } 437 } ri_BuildQueryKeyFullnot preceded byri_DetermineMatchType Leads to a potential crash
Static Specification Mining • To generate preconditions for a procedure • Generate predicates at each call-site of the procedure • Ideally common predicates across all the call-sites form the preconditions for the procedure • How to find common predicates? • Use mining techniques • Construct patterns built from alignments or permutations of predicate sets • Approximation: Patterns appearing in programs denote preconditions
Approach • Analyze control-flow graph • Build precedence relation (a <- b): • A binary relation between procedures a and b • A call to b is always preceded by call to a • Necessitates an inter-procedural analysis • Relations can cross procedure boundaries • Convergence requires fixpoint calculation • Procedure signatures • Frequent subsequence mining • Mine the chains formed by precedence relations
Path Exploration Path-Sensitive Exploration: q <- p, q <- r <- p q Path-Insensitive Exploration: q , r <- p r q q Path-Aware Exploration: q <- p p
Precedes relation q q r t q q q exit p p q <-p q <-p
Inter-procedural Analysis h() { if(cond) lwrap(); else lwrap(); … uwrap(); } lwrap () { init(); } uwrap () { access(); }
Procedure Signatures s entry s u q t r q q s <- t s <- q <- p <- t Procedure signature for s: q <- p ret p
Mining sequences • Sequence mining: • Input: set of sequences (I) • Output: sequences that occur ‘frequently’ as subsequences in I • Use the Apriori-all algorithm [Agrawal and Srikant, Mining Sequential Patterns, ICDE ’95]
Motivation for sequence mining • Control paths: Invariant: • a, b, c, e a <- c <- e • g, a, d, c, e • a, c, e • a, c, d, e, f • e, f, d, a (Faulty path, no call to a and c before e) • Intersection of these paths • e is preceded by nothing • Use mining to overcome brittleness of path intersection
Sequence Mining - Example • Input sequences: Min Frequency: 4/5 • a, b, c, e • g, a, d, c, e • a, c, e • a, c, d, e, f • e, f, d, a • Input sequences: Min Frequency: 4/5 • a, b, c, e • g, a, d, c, e • a, c, e • a, c, d, e, f • e, f, d, a • Input sequences: Min Frequency: 4/5 • a, b, c, e • g, a, d, c, e • a, c, e • a, c, d, e, f • e, f, d, a • Input sequences: Min Frequency: 4/5 • a, b, c, e • g, a, d, c, e • a, c, e • a, c, d, e, f • e, f, d, a Maximal
Data-flow preconditions (PLDI 07) • Challenges • Data-flow predicates may be aliased • No anchors for data-flow predicates if (x > 0) f(x); if (y > 0) f(y); x = g(…); h(x); if(x > 0) f(x);
Motivating Example main(…) { for(ai = options.listen_addrs;…) { listen_sock = socket(ai->ai_family,…); if(listen_sock < 0) error(); if(num_listen_socks >= 16) error(); if((ret = getnameinfo(…))) … if(setsockopt(listen_sock,…) == -1) error(); if(bind(listen_sock, ai->ai_addr,…) < 0) … } } • In a call to bind, the first parameter is always assigned the return value of a call to socketand is checked for error
Generate Predicates main(…) { for(ai = options.listen_addrs;…) { listen_sock = socket(ai->ai_family,…); if(listen_sock < 0) error(); if(num_listen_socks >= 16) error(); if((ret = getnameinfo(…))) … if(setsockopt(listen_sock,…) == -1) error(); if(bind(listen_sock, ai->ai_addr,…) < 0) … } } listen_sock: return(socket), num_listen_socks: (<,16) (param_1, bind) ret: return(getnameinfo) (param_1, setsockopt), (>=,0)
Another call-site ssh_control_listener(void) { if(control_fd = socket(PF_UNIX,…) < 0) error(); old_umask = umask(0177); if(bind(control_fd,(struct sockaddr *)&addr,…)) … control_fd: return(socket), old_umask: return(umask) (param_1, bind) (>=,0)
Structural Similarity Problem listen_sock: return(socket), num_listen_socks: (<,16) (param_1, bind) ret: return(getnameinfo) (param_1, setsockopt), (>=,0) old_umask: return(umask) control_fd: return(socket), (param_1, bind) (>=, 0) • How to group the attribute sets that need to be mined together? • Find maximal matching of attribute sets • NP-hard • Use approximations based on program structures
Approximations • Type • attribute sets divided based on type of variable • Parameter • Supplied as arguments to the same parameter for any given procedure • Result • Variables that are assigned the return values of the same function • …
Example revisited listen_sock: return(socket), num_listen_socks: (<,16) • Variable names are not comparable • Use positional information • Different number of attributes • Interspersed with irrelevant operations (param_1, bind) ret: return(getnameinfo) (param_1, setsockopt), (>=,0) old_umask: return(umask) control_fd: return(socket), (param_1, bind) (>=, 0)
Is intersection robust? sockfd: return(socket), listen_sock: return(socket), • Same limitations as with control-flow preconditions • Adopt frequent itemset mining • Order of events is less critical • Aggregate collection of data-flow facts at call-sites (param_1, bind) (param_1, bind) (param_1, setsockopt), Precondition: (>=, 0) return(socket), (param_1, bind) control_fd: return(socket), (param_1, bind) (>=, 0) missing! (>=, 0)
Locality main() { fp = init_file(…); fgets(buf, SIZE, fp); } init_file(…) { fp = fopen(…); if(fp != NULL) return fp; exit(-1); } main() { fp = fopen(…); if(fp != NULL) read_file(fp); } read_file(FILE *fp) { … fgets(buf, SIZE, fp); … } • Interprocedural analysis to capture precondition crossing procedure boundaries
Example p1 p1, p2 q p1 s p1 s q p1 p1, p2 p1 r r s p1 p2 p1 t p2 Intraprocedural edge Interprocedural edge
Experiments • Applied on open source C programs • Input to the implementation: control flow graphs • Control flow nodes varied from 16K to 958K • Roughly 2M LoC • Procedure count varied from 298 to 8568 • Precondition predicates varied from 189 to 5963 • Analysis time varied from 26s to 20m
Experimental Goals • Path awareness improves precision • Useful for bug detection • Generates salient documentation
Effectiveness of path awareness • Fewer protocols generated using our approach • Reduction not at the expense of increase in false negatives • Reduces false positives
Bug Detection: Openssh • Procedure prime_testin openssh-4.4p1 • Testing difficult as it performs Miller-Rabin primality testing • Program crashes due to the absence of a error check • e.g., BN_mod_word(p, …), if p is null, program crashes • Fixed in openssh-4.5p1 • Error check not always necessary • e.g., BN_is_prime(…, ctx,…), ctx can either be null or pre-allocated
Bug detection • Case Study: Linux • Hardware Bug • Difficult to detect using traditional testing techniques • Platform dependent error • Transparently identified using our approach • Performance Bug • Cache lookup operation was absent • Not easily specified as a bug for testing • Deviation delays data write flushes • Difficult to identify using traditional testing techniques
Change in Confidence • Increase in confidence reduces the number of predicates
Related Work • Static techniques • Inferring Specifications from Within, Kremenek et al, OSDI 06 • Bugs as deviant behavior, Engler et al, SOSP 01 • … • Dynamic techniques • Strauss, Ammons et al, POPL 02 • Daikon, Ernst et al, TSE 01 • … • Our approach • Path-aware analysis • Generates preconditions • Predicates of arbitrary size • Annotation free
Future Work • Richer specifications • Post-conditions, divergence structures, … • More sophisticated mining techniques • Graph mining, … • Validating generated specifications • Integration with theorem prover • Specifications and concurrency • Atomicity violations
Other work • Dynamic analysis • Detecting cause of assertion failures (under review) • Static path profiles (under review) • Impact analysis – ASE 06 • Memory aliasing – FASE 06 • Test case prioritization – SAC 08 • Distributed Systems • Randomized leader election (Distributed Computing 07) • Eliminating duplicates in P2P systems (TPDS 07) • Search in P2P systems (P2P 05) • Efficient tag detection in RFID systems (SECON 05)
Why not mine post-conditions? fp = fopen(…); if(fp == NULL) exit(-1); fclose(…); • Precedence protocol: • A call to fclose is always preceded by a call to fopen • Successor protocol: • A call to fopen is always succeeded by a call to fclose
Why parameter tracing is insufficient? uldap_connection_find (…) { //code fragment from httpd if (APR_SUCCESS == apr_thread_mutex_trylock(l->lock)) { … compare_client_certs(st->client_certs, l->client_certs) … } • In a call to compare_client_certs, the return value of a call to apr_thread_mutex_trylock must be APR_SUCCESS. • Predicate for compare_client_certsincludes • “return value of apr_thread_mutex_trylock(…)is APR_SUCCESS”
Predicate size distribution • Majority of predicates less than 3