520 likes | 771 Views
静态代码分析. 梁广泰 2011-05 - 25. 提纲. 动机 程序静态分析(概念 + 实例) 程序缺陷分析(科研工作). 动机. 云平台特点 应用程序直接部署在云端服务器上,存在安全隐患 直接操作破坏服务器文件系统 存在安全漏洞时,可提供黑客入口 资源共享,动态分配 单个应用的性能低下,会侵占其他应用的资源 解决方案之一: 在部署应用程序之前,对其进行静态代码分析: 是否存在违禁调用?(非法文件访问) 是否存在低效代码?(未借助 StringBuilder 对 String 进行大量拼接) 是否存在安全漏洞?( SQL 注入,跨站攻击,拒绝服务)
E N D
静态代码分析 梁广泰 2011-05-25
提纲 动机 程序静态分析(概念+实例) 程序缺陷分析(科研工作)
动机 • 云平台特点 • 应用程序直接部署在云端服务器上,存在安全隐患 • 直接操作破坏服务器文件系统 • 存在安全漏洞时,可提供黑客入口 • 资源共享,动态分配 • 单个应用的性能低下,会侵占其他应用的资源 • 解决方案之一: • 在部署应用程序之前,对其进行静态代码分析: • 是否存在违禁调用?(非法文件访问) • 是否存在低效代码?(未借助StringBuilder对String进行大量拼接) • 是否存在安全漏洞?(SQL注入,跨站攻击,拒绝服务) • 是否存在恶意病毒? • ……
提纲 动机 程序静态分析(概念+实例) 程序缺陷分析(科研工作)
静态代码分析 • 定义: • 程序静态分析是在不执行程序的情况下对其进行分析的技术,简称为静态分析。 • 对比: • 程序动态分析:需要实际执行程序 • 程序理解:静态分析这一术语一般用来形容自动化工具的分析,而人工分析则往往叫做程序理解 • 用途: • 程序翻译/编译 (编译器),程序优化重构,软件缺陷检测等 • 过程: • 大多数情况下,静态分析的输入都是源程序代码或者中间码(如Java bytecode),只有极少数情况会使用目标代码;以特定形式输出分析结果
静态代码分析 • Basic Blocks • Control Flow Graph • Dataflow Analysis • Live Variable Analysis • Reaching Definition Analysis • Lattice Theory
Basic Blocks • A basic block is a maximal sequence of consecutive three-address instructions with the following properties: • The flow of control can only enter the basic block thru the 1st instr. • Control will leave the block without halting or branching, except possibly at the last instr. • Basic blocks become the nodes of a flow graph, with edges indicating the order.
i = 1 j = 1 t1 = 10 * i t2 = t1 + j t3 = 8 * t2 t4 = t3 - 88 a[t4] = 0.0 j = j + 1 if j <= 10 goto (3) i = i + 1 if i <= 10 goto (2) i = 1 t5 = i - 1 t6 = 88 * t5 a[t6] = 1.0 i = i + 1 if i <= 10 goto (13) A B C D E F Basic Block Example Leaders Basic Blocks
Control-Flow Graphs • Control-flow graph: • Node: an instruction or sequence of instructions (a basic block) • Two instructions i, j in same basic blockiff execution of i guarantees execution of j • Directed edge: potentialflow of control • Distinguished start node Entry & Exit • First & last instruction in program
Control-Flow Edges • Basic blocks = nodes • Edges: • Add directed edge between B1 and B2 if: • Branch from last statement of B1 to first statement of B2 (B2 is a leader), or • B2 immediately follows B1 in program order and B1 does not end with unconditional branch (goto) • Definition of predecessor and successor • B1 is a predecessor of B2 • B2 is a successor of B1
静态代码分析 • Basic Blocks • Control Flow Graph • Dataflow Analysis • Live Variable Analysis • Reaching Definition Analysis • Lattice Theory
Dataflow Analysis • Compile-Time Reasoning About • Run-Time Values of Variables or Expressions • At Different Program Points • Which assignment statements produced value of variable at this point? • Which variables contain values that are no longer used after this program point? • What is the range of possible values of variable at this program point? • ……
Program Points • One program point before each node • One program point after each node • Join point – point with multiple predecessors • Split point – point with multiple successors
Live Variable Analysis • A variable v is live at point p if • v is used along some path starting at p, and • no definition of v along the path before the use. • When is a variable v dead at point p? • No use of v on any path from p to exit node, or • If all paths from p redefine v before using v.
What Use is Liveness Information? • Register allocation. • If a variable is dead, can reassign its register • Dead code elimination. • Eliminate assignments to variables not read later. • But must not eliminate last assignment to variable (such as instance variable) visible outside CFG. • Can eliminate other dead assignments. • Handle by making all externally visible variables live on exit from CFG
Conceptual Idea of Analysis • start from exit and go backwards in CFG • Compute liveness information from end to beginning of basic blocks
Liveness Example 0101110 a = x+y; t = a; c = a+x; x == 0 • Assume a,b,c visible outside method • So are live on exit • Assume x,y,z,t not visible • Represent Liveness Using Bit Vector • order is abcxyzt 1100111 a b c x y z t b = t+z; 1000111 1100100 1100100 a b c x y z t c = y+1; 1110000 a b c x y z t
Formalizing Analysis • Each basic block has • IN - set of variables live at start of block • OUT - set of variables live at end of block • USE - set of variables with upwards exposed uses in block (use prior to definition) • DEF - set of variables defined in block prior to use • USE[x = z; x = x+1;] = { z } (x not in USE) • DEF[x = z; x = x+1; y = 1;] = {x, y} • Compiler scans each basic block to derive USE and DEF sets
Algorithm for all nodes n in N - { Exit } IN[n] = emptyset; OUT[Exit] = emptyset; IN[Exit] = use[Exit]; Changed = N - { Exit }; while (Changed != emptyset) choose a node n in Changed; Changed = Changed - { n }; OUT[n] = emptyset; for all nodes s in successors(n) OUT[n] = OUT[n] U IN[p]; IN[n] = use[n] U (out[n] - def[n]); if (IN[n] changed) for all nodes p in predecessors(n) Changed = Changed U { p };
静态代码分析 – 概念 • Basic Blocks • Control Flow Graph • Dataflow Analysis • Live Variable Analysis • Reaching Definition Analysis • Lattice Theory
Reaching Definitions • Concept of definition and use • a = x+y is a definition of a is a use of x and y • A definition reaches a use if value written by definitionmay be read by use
s = 0; a = 4; i = 0; k == 0 b = 1; b = 2; i < n s = s + a*b; i = i + 1; return s Reaching Definitions
Reaching Definitions and Constant Propagation • Is a use of a variable a constant? • Check all reaching definitions • If all assign variable to same constant • Then use is in fact a constant • Can replace variable with constant
s = 0; a = 4; i = 0; k == 0 b = 1; b = 2; i < n s = s + a*b; i = i + 1; return s Is a Constant in s = s+a*b? Yes! On all reaching definitions a = 4
s = 0; a = 4; i = 0; k == 0 b = 1; b = 2; i < n s = s + 4*b; i = i + 1; return s Constant Propagation Transform Yes! On all reaching definitions a = 4
Computing Reaching Definitions • Compute with sets of definitions • represent sets using bit vectors • each definition has a position in bit vector • At each basic block, compute • definitions that reach start of block • definitions that reach end of block • Do computation by simulating execution of program until reach fixed point
1 2 3 4 5 6 7 0000000 1: s = 0; 2: a = 4; 3: i = 0; k == 0 1110000 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1110000 1110000 4: b = 1; 5: b = 2; 1111000 1110100 1 2 3 4 5 6 7 1111111 1111100 i < n 1111111 1111100 1 2 3 4 5 6 7 1111100 1111111 1 2 3 4 5 6 7 1111111 1111100 6: s = s + a*b; 7: i = i + 1; return s 1111111 1111100 0101111
Formalizing Reaching Definitions • Each basic block has • IN - set of definitions that reach beginning of block • OUT - set of definitions that reach end of block • GEN - set of definitions generated in block • KILL - set of definitions killed in block • GEN[s = s + a*b; i = i + 1;] = 0000011 • KILL[s = s + a*b; i = i + 1;] = 1010000 • Compiler scans each basic block to derive GEN and KILL sets
Forwards vs. backwards • A forwards analysis is one that for each program point computes information about the past behavior. • Examples of this are available expressions and reaching definitions. • Calculation: predecessors of CFG nodes. • A backwards analysis is one that for each program point computes information about the future behavior. • Examples of this are liveness and very busy expressions. • Calculation: successors of CFG nodes.
May vs. Must • A may analysis is one that describes information that may possibly be true and, thus, computes an upper approximation. • Examples of this are liveness and reaching definitions. • Calculation: union operator. • A must analysis is one that describes information that must definitely be true and, thus, computes a lower approximation. • Examples of this are available expressions and very busy expressions. • Calculation: intersection operator.
静态代码分析 – 概念 • Basic Blocks • Control Flow Graph • Dataflow Analysis • Live Variable Analysis • Reaching Definition Analysis • Lattice Theory
Basic Idea • Information about program represented using values from algebraic structure called lattice • Analysis produces lattice value for each program point • Two flavors of analysis • Forward dataflow analysis • Backward dataflow analysis
Partial Orders • Set P • Partial order such that x,y,zP • x x (reflexive) • x y and y x implies x y (asymmetric) • x y and y z implies x z (transitive) • Can use partial order to define • Upper and lower bounds • Least upper bound • Greatest lower bound
Upper Bounds • If S P then • xP is an upper bound of S if yS. y x • xP is the least upper bound of S if • x is an upper bound of S, and • x y for all upper bounds y of S • - join, least upper bound (lub), supremum, sup • S is the least upper bound of S • x y is the least upper bound of {x,y}
LowerBounds • If S P then • xP is a lower bound of S if yS. x y • xP is the greatest lower bound of S if • x is a lower bound of S, and • y x for all lower bounds y of S • - meet, greatest lower bound (glb), infimum, inf • S is the greatest lower bound of S • x y is the greatest lower bound of {x,y}
Covering • x y if x y and xy • x is covered by y (y covers x) if • x y, and • x z y implies x z • Conceptually, y covers x if there are no elements between x and y
Example • P = { 000, 001, 010, 011, 100, 101, 110, 111} (standard Boolean lattice, also called hypercube) • x y if (x bitwise and y) = x 111 • Hasse Diagram • If y covers x • Line from y to x • y above x in diagram 011 110 101 010 001 100 000
Lattices • If x y and x y exist for all x,yP, then P is a lattice. • If S and S exist for all S P, then P is a complete lattice. • All finite lattices are complete
Lattices • If x y and x y exist for all x,yP, then P is a lattice. • If S and S exist for all S P, then P is a complete lattice. • All finite lattices are complete • Example of a lattice that is not complete • Integers I • For any x, yI, x y = max(x,y), x y = min(x,y) • But I and I do not exist • I {, } is a complete lattice
Lattice Examples • Lattices • Non-lattices
Semi-Lattice • Only one of the two binary operations (meet or join) exist • Meet-semilattice If x y exist for all x,yP • Join-semilattice If x y exist for all x,yP
Monotonic Function & Fixed point • Let L be a lattice. A function f : L → L is monotonic if ∀x, y ∈ S : xy ⇒ f (x) f (y) • Let A be a set, f : A → A a function, a ∈A . If f (a) = a, then a is called a fixed point of f on A
Existence of Fixed Points • The height of a lattice is defined to be the length of the longest path from ⊥ to ⊤ • In a complete lattice L with finite height, every monotonic function f : L → L has a uniqueleast fixed-point :
Knaster-Tarski Fixed Point Theorem • Suppose (L, ) is a complete lattice, f: LL is a monotonic function. • Then the fixed point m of f can be defined as
Calculating Fixed Point • The time complexity of computing a fixed-point depends on three factors: • The height of the lattice, since this provides a bound for i; • The cost of computing f; • The cost of testing equality. • The computation of a fixed-point can be illustrated as a walk up the lattice starting at ⊥:
Application to Dataflow Analysis • Dataflow information will be lattice values • Transfer functions operate on lattice values • Solution algorithm will generate increasing sequence of values at each program point • Ascending chain condition will ensure termination • Will use to combine values at control-flow join points
Transfer Functions • Transfer function f: PP for each node in control flow graph • f models effect of the node on the program information
Transfer Functions Each dataflow analysis problem has a set F of transfer functions f: PP • Identity function iF • F must be closed under composition: f,gF. the function h = x.f(g(x)) F • Each f F must be monotone: x y implies f(x) f(y) • Sometimes all fF are distributive: f(x y) = f(x) f(y) • Distributivity implies monotonicity