1 / 52

静态代码分析

静态代码分析. 梁广泰 2011-05 - 25. 提纲. 动机 程序静态分析(概念 + 实例) 程序缺陷分析(科研工作). 动机. 云平台特点 应用程序直接部署在云端服务器上,存在安全隐患 直接操作破坏服务器文件系统 存在安全漏洞时,可提供黑客入口 资源共享,动态分配 单个应用的性能低下,会侵占其他应用的资源 解决方案之一: 在部署应用程序之前,对其进行静态代码分析: 是否存在违禁调用?(非法文件访问) 是否存在低效代码?(未借助 StringBuilder 对 String 进行大量拼接) 是否存在安全漏洞?( SQL 注入,跨站攻击,拒绝服务)

lael-murray
Download Presentation

静态代码分析

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 静态代码分析 梁广泰 2011-05-25

  2. 提纲 动机 程序静态分析(概念+实例) 程序缺陷分析(科研工作)

  3. 动机 • 云平台特点 • 应用程序直接部署在云端服务器上,存在安全隐患 • 直接操作破坏服务器文件系统 • 存在安全漏洞时,可提供黑客入口 • 资源共享,动态分配 • 单个应用的性能低下,会侵占其他应用的资源 • 解决方案之一: • 在部署应用程序之前,对其进行静态代码分析: • 是否存在违禁调用?(非法文件访问) • 是否存在低效代码?(未借助StringBuilder对String进行大量拼接) • 是否存在安全漏洞?(SQL注入,跨站攻击,拒绝服务) • 是否存在恶意病毒? • ……

  4. 提纲 动机 程序静态分析(概念+实例) 程序缺陷分析(科研工作)

  5. 静态代码分析 • 定义: • 程序静态分析是在不执行程序的情况下对其进行分析的技术,简称为静态分析。 • 对比: • 程序动态分析:需要实际执行程序 • 程序理解:静态分析这一术语一般用来形容自动化工具的分析,而人工分析则往往叫做程序理解 • 用途: • 程序翻译/编译 (编译器),程序优化重构,软件缺陷检测等 • 过程: • 大多数情况下,静态分析的输入都是源程序代码或者中间码(如Java bytecode),只有极少数情况会使用目标代码;以特定形式输出分析结果

  6. 静态代码分析 • Basic Blocks • Control Flow Graph • Dataflow Analysis • Live Variable Analysis • Reaching Definition Analysis • Lattice Theory

  7. Basic Blocks • A basic block is a maximal sequence of consecutive three-address instructions with the following properties: • The flow of control can only enter the basic block thru the 1st instr. • Control will leave the block without halting or branching, except possibly at the last instr. • Basic blocks become the nodes of a flow graph, with edges indicating the order.

  8. i = 1 j = 1 t1 = 10 * i t2 = t1 + j t3 = 8 * t2 t4 = t3 - 88 a[t4] = 0.0 j = j + 1 if j <= 10 goto (3) i = i + 1 if i <= 10 goto (2) i = 1 t5 = i - 1 t6 = 88 * t5 a[t6] = 1.0 i = i + 1 if i <= 10 goto (13) A B C D E F Basic Block Example Leaders Basic Blocks

  9. Control-Flow Graphs • Control-flow graph: • Node: an instruction or sequence of instructions (a basic block) • Two instructions i, j in same basic blockiff execution of i guarantees execution of j • Directed edge: potentialflow of control • Distinguished start node Entry & Exit • First & last instruction in program

  10. Control-Flow Edges • Basic blocks = nodes • Edges: • Add directed edge between B1 and B2 if: • Branch from last statement of B1 to first statement of B2 (B2 is a leader), or • B2 immediately follows B1 in program order and B1 does not end with unconditional branch (goto) • Definition of predecessor and successor • B1 is a predecessor of B2 • B2 is a successor of B1

  11. CFG Example

  12. 静态代码分析 • Basic Blocks • Control Flow Graph • Dataflow Analysis • Live Variable Analysis • Reaching Definition Analysis • Lattice Theory

  13. Dataflow Analysis • Compile-Time Reasoning About • Run-Time Values of Variables or Expressions • At Different Program Points • Which assignment statements produced value of variable at this point? • Which variables contain values that are no longer used after this program point? • What is the range of possible values of variable at this program point? • ……

  14. Program Points • One program point before each node • One program point after each node • Join point – point with multiple predecessors • Split point – point with multiple successors

  15. Live Variable Analysis • A variable v is live at point p if • v is used along some path starting at p, and • no definition of v along the path before the use. • When is a variable v dead at point p? • No use of v on any path from p to exit node, or • If all paths from p redefine v before using v.

  16. What Use is Liveness Information? • Register allocation. • If a variable is dead, can reassign its register • Dead code elimination. • Eliminate assignments to variables not read later. • But must not eliminate last assignment to variable (such as instance variable) visible outside CFG. • Can eliminate other dead assignments. • Handle by making all externally visible variables live on exit from CFG

  17. Conceptual Idea of Analysis • start from exit and go backwards in CFG • Compute liveness information from end to beginning of basic blocks

  18. Liveness Example 0101110 a = x+y; t = a; c = a+x; x == 0 • Assume a,b,c visible outside method • So are live on exit • Assume x,y,z,t not visible • Represent Liveness Using Bit Vector • order is abcxyzt 1100111 a b c x y z t b = t+z; 1000111 1100100 1100100 a b c x y z t c = y+1; 1110000 a b c x y z t

  19. Formalizing Analysis • Each basic block has • IN - set of variables live at start of block • OUT - set of variables live at end of block • USE - set of variables with upwards exposed uses in block (use prior to definition) • DEF - set of variables defined in block prior to use • USE[x = z; x = x+1;] = { z } (x not in USE) • DEF[x = z; x = x+1; y = 1;] = {x, y} • Compiler scans each basic block to derive USE and DEF sets

  20. Algorithm for all nodes n in N - { Exit } IN[n] = emptyset; OUT[Exit] = emptyset; IN[Exit] = use[Exit]; Changed = N - { Exit }; while (Changed != emptyset) choose a node n in Changed; Changed = Changed - { n }; OUT[n] = emptyset; for all nodes s in successors(n) OUT[n] = OUT[n] U IN[p]; IN[n] = use[n] U (out[n] - def[n]); if (IN[n] changed) for all nodes p in predecessors(n) Changed = Changed U { p };

  21. 静态代码分析 – 概念 • Basic Blocks • Control Flow Graph • Dataflow Analysis • Live Variable Analysis • Reaching Definition Analysis • Lattice Theory

  22. Reaching Definitions • Concept of definition and use • a = x+y is a definition of a is a use of x and y • A definition reaches a use if value written by definitionmay be read by use

  23. s = 0; a = 4; i = 0; k == 0 b = 1; b = 2; i < n s = s + a*b; i = i + 1; return s Reaching Definitions

  24. Reaching Definitions and Constant Propagation • Is a use of a variable a constant? • Check all reaching definitions • If all assign variable to same constant • Then use is in fact a constant • Can replace variable with constant

  25. s = 0; a = 4; i = 0; k == 0 b = 1; b = 2; i < n s = s + a*b; i = i + 1; return s Is a Constant in s = s+a*b? Yes! On all reaching definitions a = 4

  26. s = 0; a = 4; i = 0; k == 0 b = 1; b = 2; i < n s = s + 4*b; i = i + 1; return s Constant Propagation Transform Yes! On all reaching definitions a = 4

  27. Computing Reaching Definitions • Compute with sets of definitions • represent sets using bit vectors • each definition has a position in bit vector • At each basic block, compute • definitions that reach start of block • definitions that reach end of block • Do computation by simulating execution of program until reach fixed point

  28. 1 2 3 4 5 6 7 0000000 1: s = 0; 2: a = 4; 3: i = 0; k == 0 1110000 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1110000 1110000 4: b = 1; 5: b = 2; 1111000 1110100 1 2 3 4 5 6 7 1111111 1111100 i < n 1111111 1111100 1 2 3 4 5 6 7 1111100 1111111 1 2 3 4 5 6 7 1111111 1111100 6: s = s + a*b; 7: i = i + 1; return s 1111111 1111100 0101111

  29. Formalizing Reaching Definitions • Each basic block has • IN - set of definitions that reach beginning of block • OUT - set of definitions that reach end of block • GEN - set of definitions generated in block • KILL - set of definitions killed in block • GEN[s = s + a*b; i = i + 1;] = 0000011 • KILL[s = s + a*b; i = i + 1;] = 1010000 • Compiler scans each basic block to derive GEN and KILL sets

  30. Example

  31. Forwards vs. backwards • A forwards analysis is one that for each program point computes information about the past behavior. • Examples of this are available expressions and reaching definitions. • Calculation: predecessors of CFG nodes. • A backwards analysis is one that for each program point computes information about the future behavior. • Examples of this are liveness and very busy expressions. • Calculation: successors of CFG nodes.

  32. May vs. Must • A may analysis is one that describes information that may possibly be true and, thus, computes an upper approximation. • Examples of this are liveness and reaching definitions. • Calculation: union operator. • A must analysis is one that describes information that must definitely be true and, thus, computes a lower approximation. • Examples of this are available expressions and very busy expressions. • Calculation: intersection operator.

  33. 静态代码分析 – 概念 • Basic Blocks • Control Flow Graph • Dataflow Analysis • Live Variable Analysis • Reaching Definition Analysis • Lattice Theory

  34. Basic Idea • Information about program represented using values from algebraic structure called lattice • Analysis produces lattice value for each program point • Two flavors of analysis • Forward dataflow analysis • Backward dataflow analysis

  35. Partial Orders • Set P • Partial order  such that x,y,zP • x  x (reflexive) • x  y and y  x implies x  y (asymmetric) • x  y and y  z implies x  z (transitive) • Can use partial order to define • Upper and lower bounds • Least upper bound • Greatest lower bound

  36. Upper Bounds • If S  P then • xP is an upper bound of S if yS. y  x • xP is the least upper bound of S if • x is an upper bound of S, and • x  y for all upper bounds y of S •  - join, least upper bound (lub), supremum, sup •  S is the least upper bound of S • x  y is the least upper bound of {x,y}

  37. LowerBounds • If S  P then • xP is a lower bound of S if yS. x  y • xP is the greatest lower bound of S if • x is a lower bound of S, and • y  x for all lower bounds y of S •  - meet, greatest lower bound (glb), infimum, inf •  S is the greatest lower bound of S • x  y is the greatest lower bound of {x,y}

  38. Covering • x y if x  y and xy • x is covered by y (y covers x) if • x  y, and • x  z  y implies x  z • Conceptually, y covers x if there are no elements between x and y

  39. Example • P = { 000, 001, 010, 011, 100, 101, 110, 111} (standard Boolean lattice, also called hypercube) • x  y if (x bitwise and y) = x 111 • Hasse Diagram • If y covers x • Line from y to x • y above x in diagram 011 110 101 010 001 100 000

  40. Lattices • If x  y and x  y exist for all x,yP, then P is a lattice. • If S and S exist for all S  P, then P is a complete lattice. • All finite lattices are complete

  41. Lattices • If x  y and x  y exist for all x,yP, then P is a lattice. • If S and S exist for all S  P, then P is a complete lattice. • All finite lattices are complete • Example of a lattice that is not complete • Integers I • For any x, yI, x  y = max(x,y), x  y = min(x,y) • But  I and  I do not exist • I  {, } is a complete lattice

  42. Lattice Examples • Lattices • Non-lattices

  43. Semi-Lattice • Only one of the two binary operations (meet or join) exist • Meet-semilattice If x  y exist for all x,yP • Join-semilattice If x  y exist for all x,yP

  44. Monotonic Function & Fixed point • Let L be a lattice. A function f : L → L is monotonic if ∀x, y ∈ S : xy ⇒ f (x) f (y) • Let A be a set, f : A → A a function, a ∈A . If f (a) = a, then a is called a fixed point of f on A

  45. Existence of Fixed Points • The height of a lattice is defined to be the length of the longest path from ⊥ to ⊤ • In a complete lattice L with finite height, every monotonic function f : L → L has a uniqueleast fixed-point :

  46. Knaster-Tarski Fixed Point Theorem • Suppose (L, ) is a complete lattice, f: LL is a monotonic function. • Then the fixed point m of f can be defined as

  47. Calculating Fixed Point • The time complexity of computing a fixed-point depends on three factors: • The height of the lattice, since this provides a bound for i; • The cost of computing f; • The cost of testing equality. • The computation of a fixed-point can be illustrated as a walk up the lattice starting at ⊥:

  48. Application to Dataflow Analysis • Dataflow information will be lattice values • Transfer functions operate on lattice values • Solution algorithm will generate increasing sequence of values at each program point • Ascending chain condition will ensure termination • Will use  to combine values at control-flow join points

  49. Transfer Functions • Transfer function f: PP for each node in control flow graph • f models effect of the node on the program information

  50. Transfer Functions Each dataflow analysis problem has a set F of transfer functions f: PP • Identity function iF • F must be closed under composition: f,gF. the function h = x.f(g(x)) F • Each f F must be monotone: x  y implies f(x)  f(y) • Sometimes all fF are distributive: f(x  y) = f(x)  f(y) • Distributivity implies monotonicity

More Related