390 likes | 600 Views
Binary-Level Lightweight Data Integration to Develop Program Understanding Tools for Embedded Software in C. K. Gondow (Titech, Japan) T. Suzuki (Elmic System Inc, Japan) H. Kawashima (JAIST, Japan). Overview. Problems: Imprecision in C tools. High development cost of C tools.
E N D
Binary-Level Lightweight Data Integration to Develop Program Understanding Tools for Embedded Software in C K. Gondow (Titech, Japan) T. Suzuki (Elmic System Inc, Japan) H. Kawashima (JAIST, Japan) APSEC@BUSAN
Overview • Problems: • Imprecisionin C tools. • High development cost of C tools. • Our solution: • Binary-level lightweight data integration. • As a testbed, DWARF2 used for developing • dxref, rxref: cross-referencers • bscg: a call-graph extractor APSEC@BUSAN
Imprecision in C tools (1/3) • e.g., GNU GLOBAL cannot identify a variable 'foo' and a label 'foo'. • Users must select some one from the list. • Because GNU GLOBAL partially analyzes source code to run very fast. int main (void) { int foo; foo: goto foo; } candidate list click foo 3 test.c int foo.c foo 4 test.c foo: goto foo; APSEC@BUSAN
Imprecision in C tools (2/3) • e.g., Murphy's study: • "An Empirical Study of Static Call Graph Extractors", by Murphy, et al., ICSE, 1996. • Tells "call graphs extracted by several broadly distributed tools vary significantly enough to surprise many experienced software engineers." APSEC@BUSAN
Imprecision in C tools (3/3 ) • Quantitative results from mosaic, quoted from Murphy's paper. cflow∩Field cflow-Field Field-cflow APSEC@BUSAN
Why imprecision? (1/2) • Reason #1: many tools partially parse source code, resulting in incomplete analysis. • e.g, GNU GLOBAL, cxref, LXR, cscope, cflow... • At a glance, full-parsing seems to solve this problem, but... APSEC@BUSAN
Why imprecision? (2/2) • Reason #2: C source code is difficult to fully analyze because of • Compiler-specific extensions. • e.g., asm for inline assembly code • Ambiguous behaviors in the C standards. • undefined, unspecified, implementation-defined. • e.g., padding in a structure. APSEC@BUSAN
Compiler-specific extensions • Essential in C and embedded software. • e.g., asm is used to obtain H/W error code. • e.g., long long is used in C89's <stdio.h> • Make it hard to analyze source code. • Different compiler has different semantics. void page_fault_handler (uint32_t error) { uint32_t cr2; asm volatile ("movl %%cr2,%0":"=r"(cr2)); ... /* IA-32 control register #2 */ } APSEC@BUSAN
Ambiguous behaviors in C (1/2) • Intentional and essential to keep C compilers fast and simple. • e.g., padding in a structure is an implementation-defined behavior. • This makes pointer-analysis hard. • "Pointer analysis for programs with structures and casts", by Suan Hsi Yong, et al, PLDI'99. APSEC@BUSAN
Ambiguous behaviors in C (2/2) struct S {char c; int *ip; } *p; struct T {char c; int i; } t; t.i = 0x1234; p = (struct S *)&t; printf ("%p\n", p->ip); • Different padding on different platforms. • To obtain precise dataflow, tools need to know the padding values of the compiler. • But it is hard... struct S struct T struct S c c c padding i ip not depends on ip Solaris8 (32bit) Solaris8 (64bit) APSEC@BUSAN
Possible solutions • To modify compilers (e.g. GCC) to emit their analyzed internal data. • Seemingly high development cost. • Many compilers to be modified. • To use binary information in executables emitted by compilers. • Relatively easy, although it lacks some information, e.g., statements. APSEC@BUSAN
Our solution and result • Our solution: • Uses DWARF2 debugging information as binary information. • Preliminary experiment: • Good result for our cross-referencers and call-graph extractor. • Better precision, although: • some false negatives increased. • quantitative results are not yet obtained. APSEC@BUSAN
Demonstration • Using DWARF2, we implemented: • two cross-referencers: • dxref: only uses DWARF2 • Sample output: dxref • rxref: hybrid of dxref and GNU GLOBAL • Sample output: dxref • a static call-graph extractor: • bscg: uses DWARF2 and disassembler. • Sample outputs: fact, dxref, bash, bash APSEC@BUSAN
DWARF2-XML C code compile text data symbol info. relocation info. debug info. dxref, rxref: cross-referencers binary ELF/ DWARF2 bscg: call graph extractor extract data inte- gration use common format DWARF2-XML APSEC@BUSAN
How bscgworks • extract call instructionsby disassembling text. (2) convert addresses to symbols using DWARF2 1234: call 5678 main: call fact (3) trim call graphs according to options (4) output graph topologyin DOT of Graphviz digraph G { main -> fact; fact -> fact; } main fact usage APSEC@BUSAN
Advantages of bscg • Advantages of binary-level DI (explained later). • eg., high applicability and few false positives. • Can identify inlined functions. • Can extract a call from asm ("call fact"); • Can exclude • library functions: e.g., printf • system calls: e.g., open, fork • functions in runtime systems: _start, _fini APSEC@BUSAN
Disadvantages of bscg • No support for macro calls, signals, function pointers, optimization. • gprof-callgraph.pl can handle function pointers, since it uses dynamic information. • source-level ones (e.g., cflow) don't suffer from optimization problem. APSEC@BUSAN
So, is bscg good? • Yes! (not the best, of course) • Not easy to compare. APSEC@BUSAN
What is binary-level DI? • Provides common formats by extracting information from binary code. source code binary code compile *.c a.out binary DI analyze analyze common formats source DI DWARF2- XML Tools APSEC@BUSAN
Why binary-level DI? • Many advantages: • High applicability • Few false-positives. • More true-positives for low-level info. • Low development cost • Can improve C tool's precision. APSEC@BUSAN
What is lightweight DI? • Allows several common formats. • To be practical! Hard to perfectly integrate. heavy- weight DI light- weight DI DWARF2- XML APSEC@BUSAN
Summary • Imprecision in C tools. • Our solution: • Binary-level lightweight data integration. • As a testbed, DWARF2 used for developing • dxref, rxref: cross-referencers • bscg: call-graph extractor APSEC@BUSAN
Future works • Apply our technique to other tools: • e.g., memory profilers, slicers, test coverage tools, ... • Develop new binary formats suitable for lower CASE tools. • tool-information carrying code. • cf. proof-carrying code, model-carrying code, schedule-carrying code. APSEC@BUSAN
Taxonomy of cross referencers. • Source-level • Partial-parsing: GNU GLOBAL, LXR, ... • Full-parsing: Sapid, ACML • Binary-level • Symbol tables: Visual Studio .NET(?) • Debug info.: dxref • Hybrid: rxref APSEC@BUSAN
What is DWARF2? • A binary format for debugging information. • Primary target languages: • C, C++, Fortran, Modula2, Pascal. • Includes: • types, nested blocks, line numbers, function/object names, addresses, stack frame information, ... APSEC@BUSAN
DWARF2-XML • Our common format in XML for DWARF2. • A testbed of binary-level lightweight DI. • Makes it easier to process DWARF2. • cf. libdwarf • About 15 times larger than DWARF2. APSEC@BUSAN
DWARF2-XML example { int i; ... } address range <section name=".debug_info"> <tag name="DW_TAG_lexical_block" offset="id:27"> <attribute name="DW_AT_low_pc" value="67328"/> <attribute name="DW_AT_high_pc" value="67356"/> ... <tag name="DW_TAG_variable" offset="id:27"> <attribute name="DW_AT_name" value="i"/> <attribute name="DW_AT_type" value_ref="id:161"> <attribute name="DW_AT_location"> <description>DW_OP_fbreg: -24</description></></></></> ... <tag name="DW_TAG_base_type" offset="id:161"> <attribute name="DW_AT_name" value="int"/> <attribute name="DW_AT_byte_size" value="4"/> <attribute name="DW_AT_encoding" value="5"> <description>signed</description></></></> variable name ID/IDREF link offset to base ptr. APSEC@BUSAN
DWARF2-XML file sizes • About 15 times larger than DWARF2. • Size increase is almost cancelled by gzip. • Consumes much memory when using DOM. • e.g., we cannot build DOM tree for gdb in our environment. • Tradeoff between memory consumption and low development cost. APSEC@BUSAN gdb's LOC is about 400,000.
Execution speed • bscg is slower than the other, but acceptable for practical use. • 12000 lines in 8.8 sec. • but too bad in the case of bash-2.03. • bscg has a problem in scalability due to heavy overhead of DOM library. APSEC@BUSAN
Why XML? • Highly readable, portable, interoperable. • plain-text and self-descriptiveness. • Powerful enough to describe complex structures and relations in programs. • Nested tags and ID/IDREF links. • DTD for checking XML documents. • Flexibility to process semi-structured documents. • Easy to query/display/modify. • XML parsers, DOM/SAX, XPath. • XPath's description is much smaller than boring tree traversal code. APSEC@BUSAN
Drawbacks in API integration e.g., libdwarf • Insufficient abstraction. • Many and various data structures/access make it hard to well encapsulate them into a fixed API. • e.g., poor API in libdwarf to traverse a wide range of data tree. (only dwarf_siblingof and dwarf_child are provided.) • High cost to implement API in many languages. • High cost to learn how to use API. APSEC@BUSAN
false/true positive/negative • false positives • tool's incorrect output. • true positives • tool's correct output. • false negatives • tool's incorrect silence. • tool should have produced output, but not. • true negatives • tool's correct silence • tool should not have produced output, and not. APSEC@BUSAN
bscg's graph trimming options APSEC@BUSAN
Why lightweight DI? • To be practical! Hard to perfectly integrate. • Supported by the fact that most technologies gave up the perfect integration/definition. • e.g., undefined behaviors in C. • e.g., GNU BFD gives API integrating different binary formats. • useful, but not perfect. • cannot convert ELF/DWARF2 into Windows PE. APSEC@BUSAN
Why function pointer analysis is difficult in C? • Pointer arithmetic and casting. • e.g., (int (*)())(base + offset) • Dynamic library • e.g., handle = dlopen (libname, RTLD_LAZY); func = dlsym (handle, funcname); f (); • Inline assembly code • e.g., asm ("call foo"); APSEC@BUSAN
CASE tools development cost • Generally very high. • individual parsers & analyzers. • internal data is less interoperable and portable • IBM Eclipse • $40,000,000 (?) APSEC@BUSAN
E.g., function pointer • Cflow • apply calls f (false positive) • gprof-callgraph.pl • apply calls add5 (true positive) • Other tools (bscg) • apply calls ? (false negative) int add5 (int x) { return x + 5; } int apply (int (*f)(int), int x) { return f (x); } int main (void) { return apply (add5, 10); } APSEC@BUSAN
Our homepage • http://www.sde.cs.titech.ac.jp/~gondow/dwarf2-xml/ • DTD for DWARF2-XML • Source code of readelf+, dxref, rxref, bscg • Some sample outputs APSEC@BUSAN