Incorporating Domain-Specific Information into the Compilation Process

Incorporating Domain-Specific Information into the Compilation Process Samuel Z. Guyer Supervisor: Calvin Lin April 14, 2003

Motivation Two different views of software: • Compiler’s view • Abstractions: numbers, pointers, loops • Operators: +, -, *, ->, [] • Programmer’s view • Abstractions: files, matrices, locks, graphics • Operators: read, factor, lock, draw This discrepancy is a problem...

Find the error – part 1 • Example: • Error: case outside of switch statement • Part of the language definition • Error reported at compile time • Compiler indicates the location and nature of error switch (var_83) { case 0: func_24(); break; case 1: func_29(); break; } case 2: func_78(); !

Find the error – part 2 • Example: • Improper call to libfunc_38 • Syntax is correct – no compiler message • Fails at run-time • Problem: what does libfunc_38 do? This is how compilers view reusables struct __sue_23 * var_72; char var_81[100]; var_72 = libfunc_84(__str_14, __str_65); libfunc_44(var_72); libfunc_38(var_81, 100, 1, var_72); !

Find the error – part 3 • Example: • Improper call to fread() after fclose() • The names reveal the mistake No traditional compiler reports this error • Run-time system: how does the code fail? • Code review: rarely this easy to spot FILE * my_file; char buffer[100]; my_file = fopen(“my_data”, “r”); fclose(my_file); fread(buffer, 100, 1, my_file); !

Problem • Compilers are unaware of library semantics • Library calls have no special meaning • The compiler cannot provide any assistance • Burden is on the programmer: • Use library routines correctly • Use library routines efficiently and effectively These are difficult manual tasks • Tedious and error-prone • Can require considerable expertise

Solution • A library-level compiler • Compiler support for software libraries • Treat library routines more like built-in operators • Compile at the library interface level • Check programs for library-level errors • Improve performance with library-level optimizations Key: Libraries represent domains • Capture domain-specific semantics and expertise • Encode in a form that the compiler can use

Application Broadway Source code Error reports Analyzer Library-specific messages Library Annotations Header files Source code Optimizer Application+Library Integrated source code The Broadway Compiler • Broadway – source-to-source C compiler Domain-independent compiler mechanisms • Annotations – lightweight specification language Domain-specific analyses and transformations Many libraries, one compiler

Benefits • Improves capabilities of the compiler • Adds many new error checks and optimizations • Qualitatively different • Works with existing systems • Domain-specific compilation without recoding • For us: more thorough and convincing validation • Improve productivity • Less time spent on manual tasks • All users benefit from one set of annotations

Outline • Motivation • The Broadway Compiler • Recent work on scalable program analysis • Problem: Error checking demands powerful analysis • Solution: Client-driven analysis algorithm • Example: Detecting security vulnerabilities • Contributions • Related work • Conclusions and future work

Security vulnerabilities • How does remote hacking work? • Most are not direct attacks (e.g., cracking passwords) • Idea: trick a program into unintended behavior • Automated vulnerability detection: • How do we define “intended”? • Difficult to formalize and check application logic Libraries control all critical system services • Communication, file access, process control • Analyze routines to approximate vulnerability

Remote access vulnerability • Example: • Vulnerability: executes any remote command • What if this program runs as root? • Clearly domain-specific: sockets, processes, etc. • Requirement: • Why is detecting this vulnerability hard? int sock; char buffer[100]; sock = socket(AF_INET, SOCK_STREAM, 0); read(sock, buffer, 100); execl(buffer); ! Data from an Internet socket should not specify a program to execute

Challenge 1: Pointers • Example: • Still contains a vulnerability • Only one buffer • Variables buffer and ref are aliases We need an accurate model of memory int sock; char buffer[100]; char * ref = buffer; sock = socket(AF_INET, SOCK_STREAM, 0); read(sock, buffer, 100); execl(ref); !

main Challenge 2: Scope • Call graph: • Objects flow throughout program • No scoping constraints • Objects referenced through pointers We need whole-program analysis ! sock = (AF_INET, SOCK_STREAM, 0); (sock, buffer, 100); (ref); socket read execl

Challenge 3: Precision • Static analysis is always an approximation • Precision: level of detail or sensitivity • Multiple calls to a procedure • Context-sensitive: analyze each call separately • Context-insensitive: merge information from all calls • Multiple assignments to a variable • Flow-sensitive: record each value separately • Flow-insensitive: merge values from all assignments Lower precision reduces the cost of analysis Exponential polynomial ~linear

main stdin ? ? ! execl socket ^ execl read ^ Insufficient precision • Example: Context-insensitivity • Information merged at call • Analyzer reports 2 possible errors • Only 1 real error Imprecision leads to false positives

Cost versus precision • Problem: A tradeoff • Precise analysis prohibitively expensive • Cheap analysis too many false positives • Idea: Mixed precision analysis • Focus effort on the parts of the program that matter • Don’t waste time over-analyzing the rest Key: Let error detection problem drive precision Client-driven program analysis

Precision Policy Error Reports Information Loss Dependence Graph Monitor Adaptor Client-Driven Algorithm • Client: Error detection analysis problem • Algorithm: • Start with fast cheap analysis – monitor imprecision • Determine extra precision – reanalyze Pointer Analyzer Client Analysis Memory Model

? Algorithm components • Monitor • Runs alongside main analysis • Records imprecision • Adaptor • Start at the locations of reported errors • Trace back to the cause and diagnose

Multiple procedure calls Multiple assignments Conditions if(cond) x = foo( ) foo( ) x = x = x = foo( ) = f( , ) x Pointer dereference Polluted target Polluted pointer ptr or ptr (*ptr) Sources of imprecision Polluting assignments

main ? ? stdin execl socket execl read read read In action... • Monitor analysis • Polluting assignments • Diagnose and apply “fix” • In this case: one procedure context-sensitive • Reanalyze !

Methodology • Compare with commonly-used fixed precision • Metrics • Accuracy – number of errors reported Includes false positives – fewer is better • Performance – only when accuracy is the same

Programs • 18 real C programs • Unmodified source – all the issues of production code • Many are system tools – run in privileged mode • Representative examples:

Error detection problems • File access: • Remote access vulnerabillity: • Format string vulnerability (FSV): • Remote FSV: • FTP behavior: Files must be open when accessed Data from an Internet socket should not specify a program to execute Format string may not contain untrusted data Check if FSV is remotely exploitable Can this program be tricked into reading and transmitting arbitrary files

Full (CS-FS) Medium (CI-FS) Slow (CS-FI) Fast (CI-FI) Client-Driven ? ? ? ? ? ? ? ? ? ? ? ? 0 100X 29 26 15 7 85 0 31 5 0 0 41 1 0 28 89 0 0 7 0 5 Increasing number of CFG nodes 1 41 6 4 7 15 26 0 0 1 18 88 0 0 6 0 26 0 0 7 15 4 18 nn fcron make named SQLite muh (I) pfinger pureftp apache stunnel privoxy muh (II) cfengine wu-ftp (I) wu-ftp (II) blackhole ssh client ssh server 0 7 0 0 29 0 6 0 85 28 2 0 31 4 5 93 0 41 Results Remote access vulnerability 1000X Normalized performance 10X

Overall results • 90 test cases: 18 programs, 5 problems • Test case: 1 program and 1 error detection problem • Compare algorithms: client-driven vs. fixed precision As accurate as any other algorithm: 87 out of 90 Runs faster than best fixed algorithm: 64 out of 87 Performance not an issue: 19 of 23 Both most accurate and fastest: 29 out of 64

Why does it work? • Validates our hypothesis • Different errors have different precision requirements • Amount of extra precision is small

Outline • Motivation • The Broadway Compiler • Recent work on scalable program analysis • Problem: Error checking demands powerful analysis • Solution: Client-driven analysis algorithm • Example: Detecting security vulnerabilities • Contributions • Related work • Conclusions and future work

Central contribution • Library-level compilation • Opportunity: library interfaces make domains explicit in existing programming practice • Key: a separate language for codifying domain-specific knowledge • Result: our compiler can automate previously manual error checks and performance improvements Knowledge representation Applying knowledge Results Old way: Informal Manual Difficult, unpredictable Broadway: Codified Compiler Easy, automatic, reliable

Specific contributions • Broadway compiler implementation • Working system (43K C-Breeze, 23K pointers, 30K Broadway) • Client-driven pointer analysis algorithm [SAS’03] • Precise and scalable whole-program analysis • Library-level error checking experiments [CSTR’01] • No false positives for format string vulnerability • Library-level optimization experiments [LCPC’00] • Solid improvements for PLAPACK programs • Annotation language [DSL’99] • Balance expressive power and ease of use

Related work • Configurable compilers • Power versus usability – who is the user? • Active libraries • Previous work focusing on specific domains • Few complete, working systems • Error detection • Partial program verification – paucity of results • Scalable pointer analysis • Many studies of cost/precision tradeoff • Few mixed-precision approaches

Future work • Language • More analysis capabilities • Optimization • We have only scratched the surface • Error checking • Resource leaks • Path sensitivity – conditional transfer functions • Scalable analysis • Start with cheaper analysis – unification-based • Refine to more expensive analysis – shape analysis

Thank You

Annotations (I) • Dependence and pointer information • Describe pointer structures • Indicate which objects are accessed and modified procedure fopen(pathname, mode) { on_entry { pathname --> path_string mode --> mode_string } access { path_string, mode_string } on_exit { return --> new file_stream } }

Annotations (II) • Library-specific properties Dataflow lattices property State : { Open, Closed} initially Open property Kind : { File, Socket { Local, Remote } } ^ ^ Remote Local Closed Open File Socket ^ ^

Annotations (III) • Library routine effects Dataflow transfer functions procedure socket(domain, type, protocol) { analyze Kind { if (domain == AF_UNIX) IOHandle <- Local if (domain == AF_INET) IOHandle <- Remote } analyze State { IOHandle <- Open } on_exit { return --> new IOHandle } }

Annotations (IV) • Reports and transformations procedure execl(path, args) { on_entry { path --> path_string } reportif (Kind : path_string could-be Remote) “Error at “ ++ $callsite ++ “: remote access”; } procedure slow_routine(first, second) { when (condition) replace-with%{ quick_check($first); fast_routine($first, $second); }% }

Why does it work? • Validates our hypothesis • Different clients have different precision requirements • Amount of extra precision is small

Time

Validation • Optimization experiments • Cons: One library – three applications • Pros: Complex library – consistent results • Error checking experiments • Cons: Quibble about different errors • Pros: We set the standard for experiments • Overall • Same system designed for optimizations is among the best for detecting errors and security vulnerabilities

Flow values Types Transfer functions Inference rules Type Theory • Equivalent to dataflow analysis (heresy?) • Different in practice • Dataflow: flow-sensitive problems, iterative analysis • Types: flow-insensitive problems, constraint solver • Commonality • No magic bullet: same cost for the same precision • Extracting the store model is a primary concern Remember Phil Wadler’s talk?

Generators • Direct support for domain-specific programming • Language extensions or new language • Generate implementation from specification • Our ideas are complementary • Provides a way to analyze component compositions • Unifies common algorithms: • Redundancy elimination • Dependence-based optimizations

Is it correct? Three separate questions: • Are Sam Guyer’s experiments correct? • Yes, to the best of our knowledge • Checked PLAPACK results • Checked detected errors against known errors • Is our compiler implemented correctly? • Flip answer: who’s is? • Better answer: testing suites • How do we validate a set of annotations?

Annotation correctness Not addressed in my dissertation, but... • Theoretical approach • Does the library implement the domain? • Formally verify annotations against implementation • Practical approach • Annotation debugger: interactive • Automated assistance in early stages of development • Middle approach • Basic consistency checks

Optimistic False positives allowed It can even be unsound Tend to be “may” analyses Correctness is absolute “Black and white” Certify programs bug-free Cost tolerant Explore costly analysis Pessimistic Must preserve semantics Soundness mandatory Tend to be “must” analyses Performance is relative Spectrum of results No guarantees Cost sensitive Compile-time is a factor Error Checking vs Optimization

Complexity • Pointer analysis • Address taken: linear • Steensgaard: almost linear (log log n factor) • Anderson: polynomial (cubic) • Shape analysis: double exponential • Dataflow analysis • Intraprocedural: polynomial (height of lattice) • Context-sensitivity: exponential (call graph) • Rarely see worst-case

Optimization • Overall strategy • Exploit layers and modularity • Customize lower-level layers in context • Compiler strategy: Top-down layer processing • Preserve high-level semantics as long as possible • Systematically dissolve layer boundaries • Annotation strategy • General-purpose specialization • Idiomatic code substitutions

Processor grid PLA_Gemm( , , );  PLA_Local_gemm PLA_Gemm( , , );  PLA_Rankk PLAPACK Optimizations • PLAPACK matrices are distributed • Optimizations exploit special cases • Example: Matrix multiply

Results

Find the error – part 3 • State-of-the-art compiler struct __sue_23 * var_72; struct __sue_25 * new_f = (struct __sue_25 *) malloc(sizeof (struct __sue_25)); _IO_no_init(& new_f->fp.file, 1, 0, ((void *) 0), ((void *) 0)); (& new_f->fp)->vtable = & _IO_file_jumps; _IO_file_init(& new_f->fp); if (_IO_file_fopen((struct __sue_23 *) new_f, filename, mode, is32) != ((void *) 0)) { var_72 = & new_f->fp.file; if ((var_72->_flags2 & 1) && (var_72->_flags & 8)) { if (var_72->_mode <= 0) ((struct __sue_23 *) var_72)->vtable = & _IO_file_jumps_maybe_mmap; else ((struct __sue_23 *) var_72)->vtable = & _IO_wfile_jumps_maybe_mmap; var_72->_wide_data->_wide_vtable = & _IO_wfile_jumps_maybe_mmap; } } if (var_72->_flags & 8192U) _IO_un_link((struct __sue_23 *) var_72); if (var_72->_flags & 8192U) status = _IO_file_close_it(var_72); else status = var_72->_flags & 32U ? - 1 : 0; ((* (struct _IO_jump_t * *) ((void *) (& ((struct __sue_23 *) (var_72))->vtable) + (var_72)->_vtable_offset))->__finish)(var_72, 0); if (var_72->_mode <= 0) if (((var_72)->_IO_save_base != ((void *) 0))) _IO_free_backup_area(var_72); if (var_72 != ((struct __sue_23 *) (& _IO_2_1_stdin_)) && var_72 != ((struct __sue_23 *) (& _IO_2_1_stdout_)) && var_72 != ((struct __sue_23 *) (& _IO_2_1_stderr_))) { var_72->_flags = 0; free(var_72); } bytes_read = _IO_sgetn(var_72, (char *) var_81, bytes_requested);

Incorporating Domain-Specific Information into the Compilation Process

Incorporating Domain-Specific Information into the Compilation Process

Presentation Transcript

The compilation process

Domain-Specific Corpora

Incorporating Student Engagement into the Accreditation Process

Incorporating CCSSE and SENSE Into the Accreditation Process

Incorporating Safety into the Transportation Planning Process

Domain Specific Languages

How domain specific are Domain Specific Languages?

Compilation Process

Compilation Process

Domain Specific Language

Incorporating Student Engagement into the Accreditation Process

Incorporating Weather Information into the Insect Pest Management Decision-making Process

Incorporating Safety into the Highway Design Process

Domain Specific Languages

Compilation Process

Incorporating Information Competency into the Curriculum

Domain Specific Languages

The Process of Incorporating Energy Data into GTAP

Domain Specific Models

Domain Specific Languages