340 likes | 495 Views
An Evaluation of Automata Algorithms for String Analysis. Pieter Hooimeijer University of Virginia Margus Veanes Microsoft Research. VMCAI 2011. TL;DR. We evaluate several existing approaches for explicit-state string constraint solving, fixing external factors as much as possible.
E N D
An Evaluation of Automata Algorithms for String Analysis Pieter Hooimeijer University of Virginia MargusVeanes Microsoft Research VMCAI 2011
TL;DR We evaluate several existing approaches for explicit-state string constraint solving, fixing external factors as much as possible.
Outline • Motivation • string constraint solvers • string-related programming idioms • This Paper • benchmark and study design • results
Motivation Reasoning about strings is difficult: • for programmers • for automated tools
Constraint Solvers Hampi Kaluza Rex
Constraint Solvers Hampi Kaluza Rex ✔ String a;//... R = Regex("^ab$"); R.IsMatch(a) = true; String a;//... R = Regex("^ab$"); R.IsMatch(a) = true;
Constraint Solvers Hampi Kaluza Rex String a;// ...R = Regex("^ab$"); if (R.IsMatch(a)){ // ... } String a;//... R = Regex("^ab$"); R.IsMatch(a) = true;
what (not) to model
Example 1 char *sp = (char *) strchr(cmd , ’ ’); char *slash; while(sp && (slash = (char *) strchr(cmd, ’/’)) && (slash < sp)) { cmd= slash + 1; }
Example 1 char *sp = (char *) strchr(cmd , ’ ’); char *slash; while(sp && (slash = (char *) strchr(cmd, ’/’)) && (slash < sp)) { cmd= slash + 1; }
Example 2 How hard is regexmatching in Perl?
Example 2 perl–wle 'print"Prime" if(1 xshift) !~ /^1?$|^(11+?)\1+$/' http://montreal.pm.org/tech/neil_kandalgaonkar.shtml
Example 2 perl–wle 'print"Prime" if(1 xshift) !~ /^1?$|^(11+?)\1+$/'
Example 2 • Anchors • Non-eager matching • Backreferences /^1?$|^(11+?)\1+$/
Motivation • Existing work provides tool-to-tool performance comparisons • Confounds: Performance gains may be due to external factors
The Framework • Based on Rex • Fixes external factors: • front-end parser • regex-to-automaton conversion • implementation language • search tree
Character Sets binary decision diagramssymbolic bitvector ranges in DNF concrete set of character ranges concrete set of individual characters BDDPred Range Hash
Study Design Task 1 (55x): Task 2 (100x):
Study Design Lazy Eager Task 1 (55x): Task 2 (100x):
Study Design Lazy Eager Task 1 (55x): Unicode Unicode ASCII ASCII Task 2 (100x): Unicode Unicode ASCII ASCII
Regular Difference Lazy Eager Task 1 (55x): Unicode Unicode ASCII ASCII Task 2 (100x): Unicode Unicode ASCII ASCII
Eager Lazy Regular Intersection ASCII BDD Pred Range Hash BDD Pred Range Hash ASCII Unicode
Eager Lazy Regular Intersection ASCII BDD Pred Range Hash BDD Pred Range Hash ASCII Unicode
In Aggregate Lazy Eager Task 1 (55x): Unicode Unicode ASCII ASCII Task 2 (100x): Unicode Unicode ASCII ASCII
Conclusion • For Unicode: BDD-based approach and lazy search are fastest • SMT-based Pred approach outperforms concrete Range version