420 likes | 430 Views
This article explores the elaboration and semantic analysis of programming languages, including concepts such as type-checking, context-sensitive analysis, and symbol tables. Examples and rules are provided to demonstrate these concepts.
E N D
Elaborationor: Semantic Analysis Compiler Baojian Hua bjhua@ustc.edu.cn
Front End lexical analyzer source code tokens abstract syntax tree parser semantic analyzer IR
Elaboration • Also known as type-checking, or semantic analysis • context-sensitive analysis • Checking the context-sensitive property of programs (AST): • every variable is declared before use • every expression has a proper type • function calls conform to definitions • all other possible context-sensitive info’ (highly language-dependent) • …
Elaboration Example // Sample C code: void f (int *p) { x += 4; p (23); “hello” + “world”; } int main () { f () + 5; break; } What errors can be detected here?
Conceptually Elaborator AST Intermediate Code Language Semantics
Semantics • Traditionally, semantics takes the form of natural language specification • e.g., for the “+” operator, both the left and right operands should be of “integer” type • refer to various specifications • But recent research has revealed that semantics can also be addressed via math • rigorous and clean
Semantics • Now let’s turn to Macqueen’s note… • How to implement these rules?
Language T-SLP // Let’s make the SLP typed: T-SLP P -> DS S DS -> T id; DS | T -> bool | int S -> S ; S | id := E | print (E) | printBool (E) E -> id | num | E+E | E&&E | true | false variable declarations followed by statements two types: “bool” and “int” print an “integer” value print a “bool” value both the two sub-expressions must be booleans
Symbol Tables • In order to keep track of the types and other infos’ we’d maintain a finite map of program symbols to info’ • symbols: variables, function names, etc. • Such a mapping is called a symbol table, or sometimes an environment • Notation: {x1: b1, x2: b2, …, xn: bn} • where bi (1≤i ≤n) is called a binding
Type System • Next, we write the symbol table as ∑ • ∑=T1 x1; T2 x2; T3 x3; … • a list of (T id) tuples • may be empty • Each rule takes the form of … ∑ P1: T1 ∑ Pn: Tn ∑ C : T
Type System: exp T id ∈ ∑ ∑ num: int ∑ id: T ∑ true: bool ∑ false: bool ∑ E1: int ∑ E2: int ∑ E1+E2: int ∑ E1: bool ∑ E2: bool ∑ E1&&E2: bool
Type System: stm ∑ id: T ∑ E: T ∑|- id:=E: OK ∑ E: int ∑ print(E): OK ∑ E: bool ∑ printBool(E): OK
Type System: dec, prog id ∈dom(∑) ∑; T id DS: ∑’ ∑ T id; DS : ∑’ ∑ : ∑ ∑ S: OK DS: ∑ DS S: OK
Example // Whether or not the following program is // well-typed? int x; int y; print (x+y); int x ∈ ∑ int y ∈ ∑ ∑ x: int ∑ y: int int x; int y : ∑ int x int y: ∑ ∑ x+y: int int x; int y: ∑ ∑ print(x+y): OK int x; int y; print(x+y): OK
Elaboration of Expressions T elab_exp (sigma, num) = return int ∑ num: int
Elaboration of Expressions T elab_exp (sigma, true) = return bool ∑ true: bool
Elaboration of Expressions T elab_exp (sigma, false) = return bool ∑ false: bool
Elaboration of Expressions T elab_exp (sigma, id) = T ty = Table_lookup (sigma, id); if (ty==NULL) error (“variable not declared”); return ty; T id ∈ ∑ ∑ id : T
∑ e1: int ∑ e2: int ∑ e1+e2: int Elaboration of Expressions T elab_exp (sigma, e1+e2) = type t1 = elab_exp (sigma, e1) type t2 = elab_exp (sigma, e2) switch (t1, t2){ case (Int, Int): return Int; case (Int, _): error (“e2 should be int”) case(_, Int): error (“e1 should be int”) default: error (“should both be int”) }
Elaboration of Expressions type elab_exp (sigma, e1&&e2) = type t1 = elab_exp (sigma, e1) type t2 = elab_exp (sigma, e2) switch (t1, t2){ case (Bool, Bool): return Bool; case (Bool, _): error(“e2 should be bool”) case(_, Bool): error(“e1 should be bool”) default: error (“should both be bool”) } ∑ e1: bool ∑ e2: bool ∑ e1&&e2: bool
Elaboration of Statements void elab_stm (sigma, x=e) = type t1 = elab_exp (sigma, x); type t2 = elab_exp (sigma, e); if (t1 != t2) error (“different types in assigment”); ∑ x: ty ∑ e: ty ∑ x:=e: OK
Elaboration of Statements void elab_stm (sigma, print(e)) = type ty = elab_exp (sigma, e) if (ty != INT) error (“type should be INT”); ∑ e: int ∑ print(e): OK
Elaboration of Statements void elab_stm (sigma, printBool(e)) = type ty = elab_exp (sigma, e) if (ty != BOOL) error (“type should be BOOL”); ∑ e: bool ∑ printBool(e): OK
Elaboration of Declarations Sigma elab_decs (sigma, decs) = if (decs==[]) return sigma; // decs = type ID; decs’ if (ID\in sigma) error (“duplicated decl”); new_sigma = enter_table (sigma, type ID) return elab_decs(new_sigma, decs’); ID ∈dom(∑) ∑; type ID decs: ∑’ ∑ type ID; decs: ∑’ ∑ : ∑
Elaboration of Programs void elab_prog (decs stm) = sigma = elab_decs (decs); elab_stm (sigma, stm) decs: ∑ ∑stm: OK ∑ decs stm: OK
Moral • There may be other information associated with identifiers, not just types, say: • Scope • Storage class • Access control info’ • … • All these details are handled by symbol tables (∑)!
Implementation • Must be efficient! • lots of variables, functions, etc • Two basic approaches: • Functional • symbol table is implemented as a functional data structure (e.g., red-black tree), with no tables ever destroyed or modified • Imperative • a single table, modified for every binding added or removed • This choice is largely independent of the implementation language
Functional Symbol Table • Basic idea: • when implementing σ2 = σ1 + {x:t} • creating a new table σ2, instead of modifyingσ1 • when deleting, restore to the old table • A good data structure for this is BST or red-black tree
BST Symbol Table ’ c: int c: int e: int a: char b: double
Possible Functional Interface signature SYMBOL_TABLE = sig type ‘a t type key val empty: ‘a t val insert: ‘a t * key * ‘a -> ‘a t val lookup: ‘a t * key -> ‘a option end
Imperative Symbol Tables • The imperative approach almost always involves the use of hash tables • Need to delete entries to revert to previous environment • made simpler because deletes follow a stack discipline • can maintain a stack of entered symbols, so that they can be later popped and removed from the hash table
Possible Imperative Interface signature SYMBOL_TABLE = sig type ‘a t type key val insert: ‘a t * key * ‘a -> unit val lookup: ‘a t * key -> ‘a option val delete: ‘a t * key -> unit val beginScope: unit -> unit val endScope: unit -> unit end
Implementation of Symbols • For several reasons, it will be useful at some point to represent symbols as elements of a small, densely packed set of identities • fast comparisons (equality) • for dataflow analysis, we will want sets of variables and fast set operations • It will be critically important to use bit strings to represent the sets • For example, your liveness analysis algorithm • More on this later
Scope • How to handle lexical scope? • Many choices: • One table + insert and remove bindings during elaboration, as we enters and leaves a local scope • Stack of tables + insertion and removal always operated on stack-top • dragon compiler makes use of this
One-table approach int x; σ={x:int} int f () σ1 = σ + {f:…} = {x:int, f:…} { if (4) { int x; σ2 = σ1 + {x:int} = {x:…, f:…, x:…} x = 6; } σ1 else { int x; σ4 = σ1 + {x:int} = {x:…, f:…, x:…} x = 5; } σ1 x = 8; } σ1 Shadowing: “+” is not commutative!
Name Space struct list { int x; struct list *list; } *list; void walk (struct list *list) { list: printf (“%d\n”, list->x); if (list = list->list) goto list; }
Name Space • It’s trivial to handle name space • one symbol table for each name space • Take C as an example: • Several different name spaces • labels • tags • variables • So …
Types • The representation of types is highly language-dependent • Some key considerations: • name vs. structural equivalence • mutually recursive type definitions • errors handling
Name vs. Structural Equivalence struct A { int i; } x; struct B { int i; } y; x = y; • In a language with structural equivalence, this program is legal • But not in a language with name equivalence (e.g., C) • For name equivalence, can generate a unique symbol for each defined type • For structural equivalence, need to recursively compare the types
Mutually recursive type definitions • To process recursive and mutually recursive type definitions, need a placeholder • in ML, an option ref • in C, a pointer • in Java, bind method (read Appel) struct A { int data; struct A *next; struct B *b; }; struct B {…};
Error Diagnostic • To recover from errors, it is useful to have an “any” type • makes it possible to continue more type-checking • In practice, use “int” or guess one • Similarly, a “void” type can be used for expressions that return no value • Source locations are annotated in AST!
Summary • Elaboration checks the context-sensitive properties of programs • must take care of semantics of source programs • and may translate into more low-level forms • Usually the most big (complex) part in a compiler!