Elaboration or: Semantic Analysis

Elaborationor: Semantic Analysis Compiler Baojian Hua bjhua@ustc.edu.cn

Front End lexical analyzer source code tokens abstract syntax tree parser semantic analyzer IR

Elaboration • Also known as type-checking, or semantic analysis • context-sensitive analysis • Checking the context-sensitive property of programs (AST): • every variable is declared before use • every expression has a proper type • function calls conform to definitions • all other possible context-sensitive info’ (highly language-dependent) • …

Elaboration Example // Sample C code: void f (int *p) { x += 4; p (23); “hello” + “world”; } int main () { f () + 5; break; } What errors can be detected here?

Conceptually Elaborator AST Intermediate Code Language Semantics

Semantics • Traditionally, semantics takes the form of natural language specification • e.g., for the “+” operator, both the left and right operands should be of “integer” type • refer to various specifications • But recent research has revealed that semantics can also be addressed via math • rigorous and clean

Semantics • Now let’s turn to Macqueen’s note… • How to implement these rules?

Symbol Tables • In order to keep track of the types and other infos’ we’d maintain a finite map of program symbols to info’ • symbols: variables, function names, etc. • Such a mapping is called a symbol table, or sometimes an environment • Notation: {x1: b1, x2: b2, …, xn: bn} • where bi (1≤i ≤n) is called a binding

Type System • Next, we write the symbol table as ∑ • ∑=ty x1; ty x2; ty x3; … • a list of (ty var) tuples • may be empty • Each rule takes the form of … ∑  P1: ty ∑  Pn: ty ∑ C : ty

Type System: exp ty x ∈ ∑ ∑  n: int ∑  x: ty ∑  true: bool ∑  false: bool ∑  e1: int ∑  e2: int ∑ e1+e2: int ∑  e1: bool ∑  e2: bool ∑  e1&&e2: bool

Type System: stm ∑ x: ty ∑ e: ty ∑|- x:=e: OK ∑ e: int ∑ print(e): OK ∑ e: bool ∑ printBool(e): OK

Type System: dec, prog id ∈dom(∑) ∑; type id decs: ∑’ ∑  type id; decs : ∑’ ∑  : ∑ ∑  stm: OK decs: ∑  decs stm: OK

Example // Whether or not the following program is // well-typed? int x; int y; print (x+y); int x ∈ ∑ int y ∈ ∑ ∑ x: int ∑ y: int int x; int y  : ∑ int x  int y: ∑ ∑ x+y: int  int x; int y: ∑ ∑ print(x+y): OK   int x; int y; print(x+y): OK

Elaboration of Expressions type elab_exp (sigma, n) = return int ∑ n: int

Elaboration of Expressions type elab_exp (sigma, true) = return bool ∑ true: bool

Elaboration of Expressions type elab_exp (sigma, false) = return bool ∑ false: bool

Elaboration of Expressions type elab_exp (sigma, x) = type ty = Table_lookup (sigma, x); if (ty==NULL) error (“variable not declared”); return ty; ty x ∈ venv ∑ x : ty

∑ e1: int ∑ e2: int ∑ e1+e2: int Elaboration of Expressions type elab_exp (sigma, e1+e2) = type t1 = elab_exp (sigma, e1) type t2 = elab_exp (sigma, e2) switch (t1, t2){ case (Int, Int): return Int; case (Int, _): error (“e2 should be int”) case(_, Int): error (“e1 should be int”) default: error (“should both be int”) }

Elaboration of Expressions type elab_exp (sigma, e1&&e2) = type t1 = elab_exp (sigma, e1) type t2 = elab_exp (sigma, e2) switch (t1, t2){ case (Bool, Bool): return Bool; case (Bool, _): error(“e2 should be bool”) case(_, Bool): error(“e1 should be bool”) default: error (“should both be bool”) } ∑ e1: bool ∑ e2: bool ∑ e1&&e2: bool

Elaboration of Statements void elab_stm (sigma, x=e) = type t1 = elab_exp (sigma, x); type t2 = elab_exp (sigma, e); if (t1 != t2) error (“different types in assigment”); ∑ x: ty ∑ e: ty ∑ x:=e: OK

Elaboration of Statements void elab_stm (sigma, print(e)) = type ty = elab_exp (sigma, e) if (ty != INT) error (“type should be INT”); ∑  e: int ∑ print(e): OK

Elaboration of Statements void elab_stm (sigma, printBool(e)) = type ty = elab_exp (sigma, e) if (ty != BOOL) error (“type should be BOOL”); ∑ e: bool ∑ printBool(e): OK

Elaboration of Declarations Sigma elab_decs (sigma, decs) = if (decs==[]) return sigma; // decs = type ID; decs’ if (ID\in sigma) error (“duplicated decl”); new_sigma = enter_table (sigma, type ID) return elab_decs(new_sigma, decs’); ID ∈dom(∑) ∑; type ID  decs: ∑’ ∑ type ID; decs: ∑’ ∑  : ∑

Elaboration of Programs void elab_prog (decs stm) = sigma = elab_decs (decs); elab_stm (sigma, stm)  decs: ∑ ∑stm: OK  ∑ decs stm: OK

Moral • There may be other information associated with identifiers, not just types, say: • Scope • Storage class • Access control info’ • … • All these details are handled by symbol tables (∑)!

Implementation • Must be efficient! • lots of variables, functions, etc • Two basic approaches: • Functional • symbol table is implemented as a functional data structure (e.g., red-black tree), with no tables ever destroyed or modified • Imperative • a single table, modified for every binding added or removed • This choice is largely independent of the implementation language

Functional Symbol Table • Basic idea: • when implementing σ2 = σ1 + {x:t} • creating a new table σ2, instead of modifyingσ1 • when deleting, restore to the old table • A good data structure for this is BST or red-black tree

BST Symbol Table  ’ c: int c: int e: int a: char b: double

Possible Functional Interface signature SYMBOL_TABLE = sig type ‘a t type key val empty: ‘a t val insert: ‘a t * key * ‘a -> ‘a t val lookup: ‘a t * key -> ‘a option end

Imperative Symbol Tables • The imperative approach almost always involves the use of hash tables • Need to delete entries to revert to previous environment • made simpler because deletes follow a stack discipline • can maintain a stack of entered symbols, so that they can be later popped and removed from the hash table

Possible Imperative Interface signature SYMBOL_TABLE = sig type ‘a t type key val insert: ‘a t * key * ‘a -> unit val lookup: ‘a t * key -> ‘a option val delete: ‘a t * key -> unit val beginScope: unit -> unit val endScope: unit -> unit end

Implementation of Symbols • For several reasons, it will be useful at some point to represent symbols as elements of a small, densely packed set of identities • fast comparisons (equality) • for dataflow analysis, we will want sets of variables and fast set operations • It will be critically important to use bit strings to represent the sets • For example, your liveness analysis algorithm • More on this later

Scope • How to handle lexical scope? • Many choices: • One table + insert and remove bindings during elaboration, as we enters and leaves a local scope • Stack of tables + insertion and removal always operated on stack-top • dragon compiler makes use of this

One-table approach int x; σ={x:int} int f () σ1 = σ + {f:…} = {x:int, f:…} { if (4) { int x; σ2 = σ1 + {x:int} = {x:…, f:…, x:…} x = 6; } σ1 else { int x; σ4 = σ1 + {x:int} = {x:…, f:…, x:…} x = 5; } σ1 x = 8; } σ1 Shadowing: “+” is not commutative!

Name Space struct list { int x; struct list *list; } *list; void walk (struct list *list) { list: printf (“%d\n”, list->x); if (list = list->list) goto list; }

Name Space • It’s trivial to handle name space • one symbol table for each name space • Take C as an example: • Several different name spaces • labels • tags • variables • So …

Types • The representation of types is highly language-dependent • Some key considerations: • name vs. structural equivalence • mutually recursive type definitions • errors handling

Name vs. Structural Equivalence struct A { int i; } x; struct B { int i; } y; x = y; • In a language with structural equivalence, this program is legal • But not in a language with name equivalence (e.g., C) • For name equivalence, can generate a unique symbol for each defined type • For structural equivalence, need to recursively compare the types

Mutually recursive type definitions • To process recursive and mutually recursive type definitions, need a placeholder • in ML, an option ref • in C, a pointer • in Java, bind method (read Appel) struct A { int data; struct A *next; struct B *b; }; struct B {…};

Error Diagnostic • To recover from errors, it is useful to have an “any” type • makes it possible to continue more type-checking • In practice, use “int” or guess one • Similarly, a “void” type can be used for expressions that return no value • Source locations are annotated in AST!

Summary • Elaboration checks the context-sensitive properties of programs • must take care of semantics of source programs • and may translate into more low-level forms • Usually the most big (complex) part in a compiler!

Elaboration or: Semantic Analysis