1 / 51

Incorporating Domain-Specific Information into the Compilation Process

Explore the disparity between compiler's and programmer's views of software, errors in code compilation, and the need for a library-level compiler to enhance efficiency and correctness. Learn about Broadway Compiler and domain-specific optimizations.

ebock
Download Presentation

Incorporating Domain-Specific Information into the Compilation Process

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Incorporating Domain-Specific Information into the Compilation Process Samuel Z. Guyer Supervisor: Calvin Lin April 14, 2003

  2. Motivation Two different views of software: • Compiler’s view • Abstractions: numbers, pointers, loops • Operators: +, -, *, ->, [] • Programmer’s view • Abstractions: files, matrices, locks, graphics • Operators: read, factor, lock, draw This discrepancy is a problem...

  3. Find the error – part 1 • Example: • Error: case outside of switch statement • Part of the language definition • Error reported at compile time • Compiler indicates the location and nature of error switch (var_83) { case 0: func_24(); break; case 1: func_29(); break; } case 2: func_78(); !

  4. Find the error – part 2 • Example: • Improper call to libfunc_38 • Syntax is correct – no compiler message • Fails at run-time • Problem: what does libfunc_38 do? This is how compilers view reusables struct __sue_23 * var_72; char var_81[100]; var_72 = libfunc_84(__str_14, __str_65); libfunc_44(var_72); libfunc_38(var_81, 100, 1, var_72); !

  5. Find the error – part 3 • Example: • Improper call to fread() after fclose() • The names reveal the mistake No traditional compiler reports this error • Run-time system: how does the code fail? • Code review: rarely this easy to spot FILE * my_file; char buffer[100]; my_file = fopen(“my_data”, “r”); fclose(my_file); fread(buffer, 100, 1, my_file); !

  6. Problem • Compilers are unaware of library semantics • Library calls have no special meaning • The compiler cannot provide any assistance • Burden is on the programmer: • Use library routines correctly • Use library routines efficiently and effectively These are difficult manual tasks • Tedious and error-prone • Can require considerable expertise

  7. Solution • A library-level compiler • Compiler support for software libraries • Treat library routines more like built-in operators • Compile at the library interface level • Check programs for library-level errors • Improve performance with library-level optimizations Key: Libraries represent domains • Capture domain-specific semantics and expertise • Encode in a form that the compiler can use

  8. Application Broadway Source code Error reports Analyzer Library-specific messages Library Annotations Header files Source code Optimizer Application+Library Integrated source code The Broadway Compiler • Broadway – source-to-source C compiler Domain-independent compiler mechanisms • Annotations – lightweight specification language Domain-specific analyses and transformations Many libraries, one compiler

  9. Benefits • Improves capabilities of the compiler • Adds many new error checks and optimizations • Qualitatively different • Works with existing systems • Domain-specific compilation without recoding • For us: more thorough and convincing validation • Improve productivity • Less time spent on manual tasks • All users benefit from one set of annotations

  10. Outline • Motivation • The Broadway Compiler • Recent work on scalable program analysis • Problem: Error checking demands powerful analysis • Solution: Client-driven analysis algorithm • Example: Detecting security vulnerabilities • Contributions • Related work • Conclusions and future work

  11. Security vulnerabilities • How does remote hacking work? • Most are not direct attacks (e.g., cracking passwords) • Idea: trick a program into unintended behavior • Automated vulnerability detection: • How do we define “intended”? • Difficult to formalize and check application logic Libraries control all critical system services • Communication, file access, process control • Analyze routines to approximate vulnerability

  12. Remote access vulnerability • Example: • Vulnerability: executes any remote command • What if this program runs as root? • Clearly domain-specific: sockets, processes, etc. • Requirement: • Why is detecting this vulnerability hard? int sock; char buffer[100]; sock = socket(AF_INET, SOCK_STREAM, 0); read(sock, buffer, 100); execl(buffer); ! Data from an Internet socket should not specify a program to execute

  13. Challenge 1: Pointers • Example: • Still contains a vulnerability • Only one buffer • Variables buffer and ref are aliases We need an accurate model of memory int sock; char buffer[100]; char * ref = buffer; sock = socket(AF_INET, SOCK_STREAM, 0); read(sock, buffer, 100); execl(ref); !

  14. main Challenge 2: Scope • Call graph: • Objects flow throughout program • No scoping constraints • Objects referenced through pointers We need whole-program analysis ! sock = (AF_INET, SOCK_STREAM, 0); (sock, buffer, 100); (ref); socket read execl

  15. Challenge 3: Precision • Static analysis is always an approximation • Precision: level of detail or sensitivity • Multiple calls to a procedure • Context-sensitive: analyze each call separately • Context-insensitive: merge information from all calls • Multiple assignments to a variable • Flow-sensitive: record each value separately • Flow-insensitive: merge values from all assignments Lower precision reduces the cost of analysis Exponential polynomial ~linear

  16. main stdin ? ? ! execl socket ^ execl read ^ Insufficient precision • Example: Context-insensitivity • Information merged at call • Analyzer reports 2 possible errors • Only 1 real error Imprecision leads to false positives

  17. Cost versus precision • Problem: A tradeoff • Precise analysis prohibitively expensive • Cheap analysis too many false positives • Idea: Mixed precision analysis • Focus effort on the parts of the program that matter • Don’t waste time over-analyzing the rest Key: Let error detection problem drive precision Client-driven program analysis

  18. Precision Policy Error Reports Information Loss Dependence Graph Monitor Adaptor Client-Driven Algorithm • Client: Error detection analysis problem • Algorithm: • Start with fast cheap analysis – monitor imprecision • Determine extra precision – reanalyze Pointer Analyzer Client Analysis Memory Model

  19. ? Algorithm components • Monitor • Runs alongside main analysis • Records imprecision • Adaptor • Start at the locations of reported errors • Trace back to the cause and diagnose

  20. Multiple procedure calls Multiple assignments Conditions if(cond) x = foo( ) foo( ) x = x = x = foo( ) = f( , ) x Pointer dereference Polluted target Polluted pointer ptr or ptr (*ptr) Sources of imprecision Polluting assignments

  21. main ? ? stdin execl socket execl read read read In action... • Monitor analysis • Polluting assignments • Diagnose and apply “fix” • In this case: one procedure context-sensitive • Reanalyze !

  22. Methodology • Compare with commonly-used fixed precision • Metrics • Accuracy – number of errors reported Includes false positives – fewer is better • Performance – only when accuracy is the same

  23. Programs • 18 real C programs • Unmodified source – all the issues of production code • Many are system tools – run in privileged mode • Representative examples:

  24. Error detection problems • File access: • Remote access vulnerabillity: • Format string vulnerability (FSV): • Remote FSV: • FTP behavior: Files must be open when accessed Data from an Internet socket should not specify a program to execute Format string may not contain untrusted data Check if FSV is remotely exploitable Can this program be tricked into reading and transmitting arbitrary files

  25. Full (CS-FS) Medium (CI-FS) Slow (CS-FI) Fast (CI-FI) Client-Driven ? ? ? ? ? ? ? ? ? ? ? ? 0 100X 29 26 15 7 85 0 31 5 0 0 41 1 0 28 89 0 0 7 0 5 Increasing number of CFG nodes 1 41 6 4 7 15 26 0 0 1 18 88 0 0 6 0 26 0 0 7 15 4 18 nn fcron make named SQLite muh (I) pfinger pureftp apache stunnel privoxy muh (II) cfengine wu-ftp (I) wu-ftp (II) blackhole ssh client ssh server 0 7 0 0 29 0 6 0 85 28 2 0 31 4 5 93 0 41 Results Remote access vulnerability 1000X Normalized performance 10X

  26. Overall results • 90 test cases: 18 programs, 5 problems • Test case: 1 program and 1 error detection problem • Compare algorithms: client-driven vs. fixed precision As accurate as any other algorithm: 87 out of 90 Runs faster than best fixed algorithm: 64 out of 87 Performance not an issue: 19 of 23 Both most accurate and fastest: 29 out of 64

  27. Why does it work? • Validates our hypothesis • Different errors have different precision requirements • Amount of extra precision is small

  28. Outline • Motivation • The Broadway Compiler • Recent work on scalable program analysis • Problem: Error checking demands powerful analysis • Solution: Client-driven analysis algorithm • Example: Detecting security vulnerabilities • Contributions • Related work • Conclusions and future work

  29. Central contribution • Library-level compilation • Opportunity: library interfaces make domains explicit in existing programming practice • Key: a separate language for codifying domain-specific knowledge • Result: our compiler can automate previously manual error checks and performance improvements Knowledge representation Applying knowledge Results Old way: Informal Manual Difficult, unpredictable Broadway: Codified Compiler Easy, automatic, reliable

  30. Specific contributions • Broadway compiler implementation • Working system (43K C-Breeze, 23K pointers, 30K Broadway) • Client-driven pointer analysis algorithm [SAS’03] • Precise and scalable whole-program analysis • Library-level error checking experiments [CSTR’01] • No false positives for format string vulnerability • Library-level optimization experiments [LCPC’00] • Solid improvements for PLAPACK programs • Annotation language [DSL’99] • Balance expressive power and ease of use

  31. Related work • Configurable compilers • Power versus usability – who is the user? • Active libraries • Previous work focusing on specific domains • Few complete, working systems • Error detection • Partial program verification – paucity of results • Scalable pointer analysis • Many studies of cost/precision tradeoff • Few mixed-precision approaches

  32. Future work • Language • More analysis capabilities • Optimization • We have only scratched the surface • Error checking • Resource leaks • Path sensitivity – conditional transfer functions • Scalable analysis • Start with cheaper analysis – unification-based • Refine to more expensive analysis – shape analysis

  33. Thank You

  34. Annotations (I) • Dependence and pointer information • Describe pointer structures • Indicate which objects are accessed and modified procedure fopen(pathname, mode) { on_entry { pathname --> path_string mode --> mode_string } access { path_string, mode_string } on_exit { return --> new file_stream } }

  35. Annotations (II) • Library-specific properties Dataflow lattices property State : { Open, Closed} initially Open property Kind : { File, Socket { Local, Remote } } ^ ^ Remote Local Closed Open File Socket ^ ^

  36. Annotations (III) • Library routine effects Dataflow transfer functions procedure socket(domain, type, protocol) { analyze Kind { if (domain == AF_UNIX) IOHandle <- Local if (domain == AF_INET) IOHandle <- Remote } analyze State { IOHandle <- Open } on_exit { return --> new IOHandle } }

  37. Annotations (IV) • Reports and transformations procedure execl(path, args) { on_entry { path --> path_string } reportif (Kind : path_string could-be Remote) “Error at “ ++ $callsite ++ “: remote access”; } procedure slow_routine(first, second) { when (condition) replace-with%{ quick_check($first); fast_routine($first, $second); }% }

  38. Why does it work? • Validates our hypothesis • Different clients have different precision requirements • Amount of extra precision is small

  39. Time

  40. Validation • Optimization experiments • Cons: One library – three applications • Pros: Complex library – consistent results • Error checking experiments • Cons: Quibble about different errors • Pros: We set the standard for experiments • Overall • Same system designed for optimizations is among the best for detecting errors and security vulnerabilities

  41. Flow values Types Transfer functions Inference rules Type Theory • Equivalent to dataflow analysis (heresy?) • Different in practice • Dataflow: flow-sensitive problems, iterative analysis • Types: flow-insensitive problems, constraint solver • Commonality • No magic bullet: same cost for the same precision • Extracting the store model is a primary concern Remember Phil Wadler’s talk?

  42. Generators • Direct support for domain-specific programming • Language extensions or new language • Generate implementation from specification • Our ideas are complementary • Provides a way to analyze component compositions • Unifies common algorithms: • Redundancy elimination • Dependence-based optimizations

  43. Is it correct? Three separate questions: • Are Sam Guyer’s experiments correct? • Yes, to the best of our knowledge • Checked PLAPACK results • Checked detected errors against known errors • Is our compiler implemented correctly? • Flip answer: who’s is? • Better answer: testing suites • How do we validate a set of annotations?

  44. Annotation correctness Not addressed in my dissertation, but... • Theoretical approach • Does the library implement the domain? • Formally verify annotations against implementation • Practical approach • Annotation debugger: interactive • Automated assistance in early stages of development • Middle approach • Basic consistency checks

  45. Optimistic False positives allowed It can even be unsound Tend to be “may” analyses Correctness is absolute “Black and white” Certify programs bug-free Cost tolerant Explore costly analysis Pessimistic Must preserve semantics Soundness mandatory Tend to be “must” analyses Performance is relative Spectrum of results No guarantees Cost sensitive Compile-time is a factor Error Checking vs Optimization

  46. Complexity • Pointer analysis • Address taken: linear • Steensgaard: almost linear (log log n factor) • Anderson: polynomial (cubic) • Shape analysis: double exponential • Dataflow analysis • Intraprocedural: polynomial (height of lattice) • Context-sensitivity: exponential (call graph) • Rarely see worst-case

  47. Optimization • Overall strategy • Exploit layers and modularity • Customize lower-level layers in context • Compiler strategy: Top-down layer processing • Preserve high-level semantics as long as possible • Systematically dissolve layer boundaries • Annotation strategy • General-purpose specialization • Idiomatic code substitutions

  48. Processor grid PLA_Gemm( , , );  PLA_Local_gemm PLA_Gemm( , , );  PLA_Rankk PLAPACK Optimizations • PLAPACK matrices are distributed • Optimizations exploit special cases • Example: Matrix multiply

  49. Results

  50. Find the error – part 3 • State-of-the-art compiler struct __sue_23 * var_72; struct __sue_25 * new_f = (struct __sue_25 *) malloc(sizeof (struct __sue_25)); _IO_no_init(& new_f->fp.file, 1, 0, ((void *) 0), ((void *) 0)); (& new_f->fp)->vtable = & _IO_file_jumps; _IO_file_init(& new_f->fp); if (_IO_file_fopen((struct __sue_23 *) new_f, filename, mode, is32) != ((void *) 0)) {   var_72 = & new_f->fp.file;   if ((var_72->_flags2 & 1) && (var_72->_flags & 8)) {     if (var_72->_mode <= 0) ((struct __sue_23 *) var_72)->vtable = & _IO_file_jumps_maybe_mmap;     else ((struct __sue_23 *) var_72)->vtable = & _IO_wfile_jumps_maybe_mmap;     var_72->_wide_data->_wide_vtable = & _IO_wfile_jumps_maybe_mmap;   } } if (var_72->_flags & 8192U) _IO_un_link((struct __sue_23 *) var_72); if (var_72->_flags & 8192U) status = _IO_file_close_it(var_72);   else status = var_72->_flags & 32U ? - 1 : 0; ((* (struct _IO_jump_t * *) ((void *) (& ((struct __sue_23 *) (var_72))->vtable) +                              (var_72)->_vtable_offset))->__finish)(var_72, 0); if (var_72->_mode <= 0)   if (((var_72)->_IO_save_base != ((void *) 0))) _IO_free_backup_area(var_72); if (var_72 != ((struct __sue_23 *) (& _IO_2_1_stdin_)) && var_72 != ((struct __sue_23 *) (& _IO_2_1_stdout_)) &&     var_72 != ((struct __sue_23 *) (& _IO_2_1_stderr_))) { var_72->_flags = 0;   free(var_72); } bytes_read = _IO_sgetn(var_72, (char *) var_81, bytes_requested);

More Related