510 likes | 523 Views
This paper explores the problem of incorporating domain-specific information into the compilation process for software development, including error checking and optimization. It introduces the Broadway Compiler as a solution, which supports software libraries and improves productivity through annotations.
E N D
Incorporating Domain-Specific Information into the Compilation Process Samuel Z. Guyer Supervisor: Calvin Lin April 14, 2003
Motivation Two different views of software: • Compiler’s view • Abstractions: numbers, pointers, loops • Operators: +, -, *, ->, [] • Programmer’s view • Abstractions: files, matrices, locks, graphics • Operators: read, factor, lock, draw This discrepancy is a problem...
Find the error – part 1 • Example: • Error: case outside of switch statement • Part of the language definition • Error reported at compile time • Compiler indicates the location and nature of error switch (var_83) { case 0: func_24(); break; case 1: func_29(); break; } case 2: func_78(); !
Find the error – part 2 • Example: • Improper call to libfunc_38 • Syntax is correct – no compiler message • Fails at run-time • Problem: what does libfunc_38 do? This is how compilers view reusables struct __sue_23 * var_72; char var_81[100]; var_72 = libfunc_84(__str_14, __str_65); libfunc_44(var_72); libfunc_38(var_81, 100, 1, var_72); !
Find the error – part 3 • Example: • Improper call to fread() after fclose() • The names reveal the mistake No traditional compiler reports this error • Run-time system: how does the code fail? • Code review: rarely this easy to spot FILE * my_file; char buffer[100]; my_file = fopen(“my_data”, “r”); fclose(my_file); fread(buffer, 100, 1, my_file); !
Problem • Compilers are unaware of library semantics • Library calls have no special meaning • The compiler cannot provide any assistance • Burden is on the programmer: • Use library routines correctly • Use library routines efficiently and effectively These are difficult manual tasks • Tedious and error-prone • Can require considerable expertise
Solution • A library-level compiler • Compiler support for software libraries • Treat library routines more like built-in operators • Compile at the library interface level • Check programs for library-level errors • Improve performance with library-level optimizations Key: Libraries represent domains • Capture domain-specific semantics and expertise • Encode in a form that the compiler can use
Application Broadway Source code Error reports Analyzer Library-specific messages Library Annotations Header files Source code Optimizer Application+Library Integrated source code The Broadway Compiler • Broadway – source-to-source C compiler Domain-independent compiler mechanisms • Annotations – lightweight specification language Domain-specific analyses and transformations Many libraries, one compiler
Benefits • Improves capabilities of the compiler • Adds many new error checks and optimizations • Qualitatively different • Works with existing systems • Domain-specific compilation without recoding • For us: more thorough and convincing validation • Improve productivity • Less time spent on manual tasks • All users benefit from one set of annotations
Outline • Motivation • The Broadway Compiler • Recent work on scalable program analysis • Problem: Error checking demands powerful analysis • Solution: Client-driven analysis algorithm • Example: Detecting security vulnerabilities • Contributions • Related work • Conclusions and future work
Security vulnerabilities • How does remote hacking work? • Most are not direct attacks (e.g., cracking passwords) • Idea: trick a program into unintended behavior • Automated vulnerability detection: • How do we define “intended”? • Difficult to formalize and check application logic Libraries control all critical system services • Communication, file access, process control • Analyze routines to approximate vulnerability
Remote access vulnerability • Example: • Vulnerability: executes any remote command • What if this program runs as root? • Clearly domain-specific: sockets, processes, etc. • Requirement: • Why is detecting this vulnerability hard? int sock; char buffer[100]; sock = socket(AF_INET, SOCK_STREAM, 0); read(sock, buffer, 100); execl(buffer); ! Data from an Internet socket should not specify a program to execute
Challenge 1: Pointers • Example: • Still contains a vulnerability • Only one buffer • Variables buffer and ref are aliases We need an accurate model of memory int sock; char buffer[100]; char * ref = buffer; sock = socket(AF_INET, SOCK_STREAM, 0); read(sock, buffer, 100); execl(ref); !
main Challenge 2: Scope • Call graph: • Objects flow throughout program • No scoping constraints • Objects referenced through pointers We need whole-program analysis ! sock = (AF_INET, SOCK_STREAM, 0); (sock, buffer, 100); (ref); socket read execl
Challenge 3: Precision • Static analysis is always an approximation • Precision: level of detail or sensitivity • Multiple calls to a procedure • Context-sensitive: analyze each call separately • Context-insensitive: merge information from all calls • Multiple assignments to a variable • Flow-sensitive: record each value separately • Flow-insensitive: merge values from all assignments Lower precision reduces the cost of analysis Exponential polynomial ~linear
main stdin ? ? ! execl socket ^ execl read ^ Insufficient precision • Example: Context-insensitivity • Information merged at call • Analyzer reports 2 possible errors • Only 1 real error Imprecision leads to false positives
Cost versus precision • Problem: A tradeoff • Precise analysis prohibitively expensive • Cheap analysis too many false positives • Idea: Mixed precision analysis • Focus effort on the parts of the program that matter • Don’t waste time over-analyzing the rest Key: Let error detection problem drive precision Client-driven program analysis
Precision Policy Error Reports Information Loss Dependence Graph Monitor Adaptor Client-Driven Algorithm • Client: Error detection analysis problem • Algorithm: • Start with fast cheap analysis – monitor imprecision • Determine extra precision – reanalyze Pointer Analyzer Client Analysis Memory Model
? Algorithm components • Monitor • Runs alongside main analysis • Records imprecision • Adaptor • Start at the locations of reported errors • Trace back to the cause and diagnose
Multiple procedure calls Multiple assignments Conditions if(cond) x = foo( ) foo( ) x = x = x = foo( ) = f( , ) x Pointer dereference Polluted target Polluted pointer ptr or ptr (*ptr) Sources of imprecision Polluting assignments
main ? ? stdin execl socket execl read read read In action... • Monitor analysis • Polluting assignments • Diagnose and apply “fix” • In this case: one procedure context-sensitive • Reanalyze !
Methodology • Compare with commonly-used fixed precision • Metrics • Accuracy – number of errors reported Includes false positives – fewer is better • Performance – only when accuracy is the same
Programs • 18 real C programs • Unmodified source – all the issues of production code • Many are system tools – run in privileged mode • Representative examples:
Error detection problems • File access: • Remote access vulnerabillity: • Format string vulnerability (FSV): • Remote FSV: • FTP behavior: Files must be open when accessed Data from an Internet socket should not specify a program to execute Format string may not contain untrusted data Check if FSV is remotely exploitable Can this program be tricked into reading and transmitting arbitrary files
Full (CS-FS) Medium (CI-FS) Slow (CS-FI) Fast (CI-FI) Client-Driven ? ? ? ? ? ? ? ? ? ? ? ? 0 100X 29 26 15 7 85 0 31 5 0 0 41 1 0 28 89 0 0 7 0 5 Increasing number of CFG nodes 1 41 6 4 7 15 26 0 0 1 18 88 0 0 6 0 26 0 0 7 15 4 18 nn fcron make named SQLite muh (I) pfinger pureftp apache stunnel privoxy muh (II) cfengine wu-ftp (I) wu-ftp (II) blackhole ssh client ssh server 0 7 0 0 29 0 6 0 85 28 2 0 31 4 5 93 0 41 Results Remote access vulnerability 1000X Normalized performance 10X
Overall results • 90 test cases: 18 programs, 5 problems • Test case: 1 program and 1 error detection problem • Compare algorithms: client-driven vs. fixed precision As accurate as any other algorithm: 87 out of 90 Runs faster than best fixed algorithm: 64 out of 87 Performance not an issue: 19 of 23 Both most accurate and fastest: 29 out of 64
Why does it work? • Validates our hypothesis • Different errors have different precision requirements • Amount of extra precision is small
Outline • Motivation • The Broadway Compiler • Recent work on scalable program analysis • Problem: Error checking demands powerful analysis • Solution: Client-driven analysis algorithm • Example: Detecting security vulnerabilities • Contributions • Related work • Conclusions and future work
Central contribution • Library-level compilation • Opportunity: library interfaces make domains explicit in existing programming practice • Key: a separate language for codifying domain-specific knowledge • Result: our compiler can automate previously manual error checks and performance improvements Knowledge representation Applying knowledge Results Old way: Informal Manual Difficult, unpredictable Broadway: Codified Compiler Easy, automatic, reliable
Specific contributions • Broadway compiler implementation • Working system (43K C-Breeze, 23K pointers, 30K Broadway) • Client-driven pointer analysis algorithm [SAS’03] • Precise and scalable whole-program analysis • Library-level error checking experiments [CSTR’01] • No false positives for format string vulnerability • Library-level optimization experiments [LCPC’00] • Solid improvements for PLAPACK programs • Annotation language [DSL’99] • Balance expressive power and ease of use
Related work • Configurable compilers • Power versus usability – who is the user? • Active libraries • Previous work focusing on specific domains • Few complete, working systems • Error detection • Partial program verification – paucity of results • Scalable pointer analysis • Many studies of cost/precision tradeoff • Few mixed-precision approaches
Future work • Language • More analysis capabilities • Optimization • We have only scratched the surface • Error checking • Resource leaks • Path sensitivity – conditional transfer functions • Scalable analysis • Start with cheaper analysis – unification-based • Refine to more expensive analysis – shape analysis
Annotations (I) • Dependence and pointer information • Describe pointer structures • Indicate which objects are accessed and modified procedure fopen(pathname, mode) { on_entry { pathname --> path_string mode --> mode_string } access { path_string, mode_string } on_exit { return --> new file_stream } }
Annotations (II) • Library-specific properties Dataflow lattices property State : { Open, Closed} initially Open property Kind : { File, Socket { Local, Remote } } ^ ^ Remote Local Closed Open File Socket ^ ^
Annotations (III) • Library routine effects Dataflow transfer functions procedure socket(domain, type, protocol) { analyze Kind { if (domain == AF_UNIX) IOHandle <- Local if (domain == AF_INET) IOHandle <- Remote } analyze State { IOHandle <- Open } on_exit { return --> new IOHandle } }
Annotations (IV) • Reports and transformations procedure execl(path, args) { on_entry { path --> path_string } reportif (Kind : path_string could-be Remote) “Error at “ ++ $callsite ++ “: remote access”; } procedure slow_routine(first, second) { when (condition) replace-with%{ quick_check($first); fast_routine($first, $second); }% }
Why does it work? • Validates our hypothesis • Different clients have different precision requirements • Amount of extra precision is small
Validation • Optimization experiments • Cons: One library – three applications • Pros: Complex library – consistent results • Error checking experiments • Cons: Quibble about different errors • Pros: We set the standard for experiments • Overall • Same system designed for optimizations is among the best for detecting errors and security vulnerabilities
Flow values Types Transfer functions Inference rules Type Theory • Equivalent to dataflow analysis (heresy?) • Different in practice • Dataflow: flow-sensitive problems, iterative analysis • Types: flow-insensitive problems, constraint solver • Commonality • No magic bullet: same cost for the same precision • Extracting the store model is a primary concern Remember Phil Wadler’s talk?
Generators • Direct support for domain-specific programming • Language extensions or new language • Generate implementation from specification • Our ideas are complementary • Provides a way to analyze component compositions • Unifies common algorithms: • Redundancy elimination • Dependence-based optimizations
Is it correct? Three separate questions: • Are Sam Guyer’s experiments correct? • Yes, to the best of our knowledge • Checked PLAPACK results • Checked detected errors against known errors • Is our compiler implemented correctly? • Flip answer: who’s is? • Better answer: testing suites • How do we validate a set of annotations?
Annotation correctness Not addressed in my dissertation, but... • Theoretical approach • Does the library implement the domain? • Formally verify annotations against implementation • Practical approach • Annotation debugger: interactive • Automated assistance in early stages of development • Middle approach • Basic consistency checks
Optimistic False positives allowed It can even be unsound Tend to be “may” analyses Correctness is absolute “Black and white” Certify programs bug-free Cost tolerant Explore costly analysis Pessimistic Must preserve semantics Soundness mandatory Tend to be “must” analyses Performance is relative Spectrum of results No guarantees Cost sensitive Compile-time is a factor Error Checking vs Optimization
Complexity • Pointer analysis • Address taken: linear • Steensgaard: almost linear (log log n factor) • Anderson: polynomial (cubic) • Shape analysis: double exponential • Dataflow analysis • Intraprocedural: polynomial (height of lattice) • Context-sensitivity: exponential (call graph) • Rarely see worst-case
Optimization • Overall strategy • Exploit layers and modularity • Customize lower-level layers in context • Compiler strategy: Top-down layer processing • Preserve high-level semantics as long as possible • Systematically dissolve layer boundaries • Annotation strategy • General-purpose specialization • Idiomatic code substitutions
Processor grid PLA_Gemm( , , ); PLA_Local_gemm PLA_Gemm( , , ); PLA_Rankk PLAPACK Optimizations • PLAPACK matrices are distributed • Optimizations exploit special cases • Example: Matrix multiply
Find the error – part 3 • State-of-the-art compiler struct __sue_23 * var_72; struct __sue_25 * new_f = (struct __sue_25 *) malloc(sizeof (struct __sue_25)); _IO_no_init(& new_f->fp.file, 1, 0, ((void *) 0), ((void *) 0)); (& new_f->fp)->vtable = & _IO_file_jumps; _IO_file_init(& new_f->fp); if (_IO_file_fopen((struct __sue_23 *) new_f, filename, mode, is32) != ((void *) 0)) { var_72 = & new_f->fp.file; if ((var_72->_flags2 & 1) && (var_72->_flags & 8)) { if (var_72->_mode <= 0) ((struct __sue_23 *) var_72)->vtable = & _IO_file_jumps_maybe_mmap; else ((struct __sue_23 *) var_72)->vtable = & _IO_wfile_jumps_maybe_mmap; var_72->_wide_data->_wide_vtable = & _IO_wfile_jumps_maybe_mmap; } } if (var_72->_flags & 8192U) _IO_un_link((struct __sue_23 *) var_72); if (var_72->_flags & 8192U) status = _IO_file_close_it(var_72); else status = var_72->_flags & 32U ? - 1 : 0; ((* (struct _IO_jump_t * *) ((void *) (& ((struct __sue_23 *) (var_72))->vtable) + (var_72)->_vtable_offset))->__finish)(var_72, 0); if (var_72->_mode <= 0) if (((var_72)->_IO_save_base != ((void *) 0))) _IO_free_backup_area(var_72); if (var_72 != ((struct __sue_23 *) (& _IO_2_1_stdin_)) && var_72 != ((struct __sue_23 *) (& _IO_2_1_stdout_)) && var_72 != ((struct __sue_23 *) (& _IO_2_1_stderr_))) { var_72->_flags = 0; free(var_72); } bytes_read = _IO_sgetn(var_72, (char *) var_81, bytes_requested);