Abstract Syntax Trees

Abstract Syntax Trees Compiler Baojian Hua bjhua@ustc.edu.cn

Front End lexical analyzer source code tokens abstract syntax tree parser semantic analyzer IR

Recap • Lexer • Program source to token sequence • Parser • token sequence, and answer Y or N • Today’s topic: • abstract syntax trees

E E * E 15 ( E ) E + E 3 4 Abstract Syntax Trees • Parse trees encodes the grammatical structure of the source program • However, they contain a lot of unnecessary information • What are essential here?

E E * E 15 ( E ) E + E 3 4 Abstract Syntax Trees • For the compiler to understand an expression, it only need to know operators and operands • punctuations, parentheses, etc. are not needed • Similar for statements, functions, etc.

E E * E 15 ( E ) E + E 3 4 Abstract Syntax Trees * 15 + 3 4 Parse tree Abstract syntax tree

Concrete vs Abstract Syntax • Concrete Syntax is needed for parsing • includes punctuation symbols, factoring, elimination of left recursion, depends on the format of the input • Abstract Syntax is a simpler, more convenient internal representation • clean interface between the parser and the later phases of the compiler

S E E + T T T * F x F F 2 3 Concrete and Abstract Syntax 2 + 3 * x E ::= E + T | T T ::= T * F | F F ::= id | num | ( E )

Concrete and Abstract Syntax 2 + 3 * x E ::= id | num | E + E | E * E | ( E ) + 2 * 3 x

AST Data Structures • In the compiler, abstract syntax makes use of the implementation language to represent aspects of the grammatical structure • Highly target and implementation languages dependent • art more than science

AST in C

AST for “exp” in C /* data structures */ typedef struct E *E; enum kind {ID, INT, ADD, TIMES}; struct E { enum kind kind; union { char *id; int num; struct {E e1; E e2;} add; struct {E e1; E e2;} times; } u; }; // This technique is called tagged-union. E -> id | num | E + E | E * E | ( E )

AST in C /* sample program “2+3*x” */ E e1 = malloc (sizeof (*e1)); e1->kind = INT; e1->u.num = 3; E e2 = malloc (sizeof (*e2)); e2->kind = ID; e2->u.id = “x”; E e3 = malloc (sizeof (*e3)); e3->kind = TIMES; e3->u.times.e1 = e1; e2->u.times.e2 = e2; … /* boring and error-prone :-( */ E ::= id | num | E + E | E * E | ( E )

AST for “stm” in C /* data structures */ typedef List<S> SS; typedef struct S *S; enum kind {ASSIGN, PRINT}; struct S { enum kind kind; union { struct {char *id; E e;} assign; E print; } u; }; (* to encode “x:=3; print(x) *) prog = …; // leave to you… SS -> S | SS ; S S -> id := E | print (E)

Operations are tree-walkings (* pretty printing *) int pp (E e){ switch (e->kind) { case INT: printf (“%d”, e->u.num); return; case ID: printf (“%s”, e->u.id); return; case ADD: printf (“(“); pp (e->u.add.e1); printf (“)”); printf (“ + “); printf (“(“); pp (e->u.add.e2); printf (“)”); return; case TIMES: /* similar */ default: error (“compiler bug”); } }

Operations are tree-walkings (* number of nodes in an AST *) int numNodes (E e) { switch (e->kind) { case INT: return 1; case ID: return 1; case ADD: case TIMES: return 1 + numNodes (e->u.add.e1) + numNodes (e->u.add.e2); default: error (“compiler bug”); } } C compiler is stupid!

AST in F#

AST for “stm” in SML /* data structures */ type SS = S list type S = Assign of string * exp | Print of exp (* to encode “x:=3; print(x)” *) val prog = [Assign (“x”, 3), Print (“x”)] SS -> S | SS ; S S -> x := E | print (E)

AST in F# (* number of nodes *) let rec numNodes e = match e with Int _ => 1 | Id _ => 1 | Add (e1, e2) => 1 + (numNodes e1) + (numNodes e2) | Times (e1, e2) => 1 + (numNodes e1) + (numNodes e2) (* Note this may be too inefficient, why? *)

AST in F# (* tail-recursion using caching *) let rec numNodes (e, n) = match e with Int _ => 1 + n | Id _ => 1 + n | Add (e1, e2) => let val n’ = numNodes (e1, n) in numNodes (e2, 1+n’) end | Times (e1, e2) => …(*similar)

AST in F# (* yet another version using reference *) val nodes = ref 0; val op ++ = fn x => x := !x + 1 let rec numNodes e = match e with Int _ => ++ nodes | Id _ => ++ nodes | Add (e1, e2) => (numNodes e1 ; ++ nodes ; numNodes e2) ) | Times (e1, e2) => … (* similar *)

AST in Java

AST for “exp” in Java /* data structures */ abstract class Exp {} class IntExp extends Exp { int i; IntExp (int i){ this.i = i; } } // contructors omitted from the following classes class IdExp extends Exp {String id;} class AddExp extends Exp {Exp e1; Exp e2;} class TimesExp extends Exp {Exp e1; Exp e2;} E ::= id | num | E + E | E * E | ( E )

Local Class Hierarchy Exp E ::= id | num | E + E | E * E | ( E ) IntExp IdExp AddExp TimesExp /* to encode “2+3*x” */ Exp e = new AddExp(new IntExp (2) , new TimesExp (new IntExp (3) , new IdExp (“x”))) /* Not so ugly as that in C, but still boring */

AST for “stm” in Java /* data structures */ class Stms{ LinkedList<Stm> stms; } class Stm{} class AssignStm extends Stm{ String x; Exp e; } class PrintStm extends Stm {Exp e;} (* to encode “x:=3; print(x)” *) val prog = LinkedList.addAll(new AssignStm (…) , new PrintStm(…)); stms -> stm | stms ; stm stm -> x := e | print (e)

AST in Java (* number of nodes again *) int numNodes (Exp e) { if (e instanceof IntExp) return 1; else if (e instanceof IdExp) return 1; else if (e instanceof AddExp) { AddExp f = (Add)e; return 1 + numNodes(f.e1) + numNodes(f.e2); } … } But this break the modularity of Java. A better way is to use the so-called visitor pattern. Read Tiger chap 4 and do lab 2.

How to construct AST automatically?

AST Generations • Attribute-grammar scheme • each nonterminal may have a semantic value v associated with it • when the parser recognizes rule X -> s1 … sn • a semantic action will be executed • uses semantic values from symbols in si • when parsing completes successfully • parser returns semantic value associated with the start symbol • usually an abstract syntax tree

AST Generations in tools • In a top-down parser, ASTs are returned (recursively) as values • Yacc-like tools encode this strategy in semantics actions

List<S> S E /* AST generation in recursive decedent parser */ List<S> parse_stms () = List<S> list = new List<S>(); S stm = parse_stm (); list.addLast(stm); while (current_token == ;) eat (;); stm = parse_stm (); list.addLast (stm); return list; SS -> S | S ; SS S -> id := E | print (E) E -> id | num | E + E | E * E

List<S> S E /* AST generation in recursive decedent parser */ S parse_stm () = switch (current_token) case ID: String x = current_token; // remember the “x” eat(ASSIGN); E exp = parse_exp (); S stm = new AssignStm (x, exp); return stm; case PRINT: eat(PRINT); eat (LPAREN); E exp = parse_exp (); eat (RPAREN); S stm = new PrintStm (exp); return stm; SS -> S | S ; SS S -> id := E | print (E) E -> id | num | E + E | E * E

List<S> S E /* AST generation in recursive decedent parser */ E parse_addexp () = E e1 = parse_timesexp (); while (current_token == +) eat (+); E e2 = parse_timesexp (); E e3 = new AddExp (e1, e2); return e3; E parse_timesexp () = E e1 = parse_atom (); while (current_token == *) eat (*); E e2 = parse_atom (); E e3 = new AddExp (e1, e2); return e3; SS -> S | S ; SS S -> id := E | print (E) E -> id | num | E + E | E * E

List<S> S E /* AST generation in recursive decedent parser */ E parse_atom () = switch (current_token) case ID: return new IdExp (current_token); case NUM: return new NumExp (current_token); default: error (“want ID or NUM”, but got …); SS -> S | S ; SS S -> id := E | print (E) E -> id | num | E + E | E * E

E AST generation in LR parser 2 F T E E + E + 3 E + F E + T 2 + 3 * 4 + 3 * 4 + 3 * 4 + 3 * 4 + 3 * 4 3 * 4 * 4  * 4  * 4 S + * 2 + T E 3 T * F 2 T 4 Each nonterminal is associated with a tree. 3 F 4 2 F 4 3 2 3 2

AST generation in LR parser e -> e PLUS e ($$ = Add ($1, $3)) | e TIMES e ($$ = Times ($1, $3)) | ID ($$ = Id ($1)) | NUM ($$ = Num ($1))

Source Position • In one-pass compiler, error messages are precise • early compilers never worry about this • But in a multi-pass compiler, source positions must be stored in AST for later use (* Example *) class AddExp{ Exp left; Exp right; int lineNum; int columnNum; }

Summary • Abstract syntax trees are compiler internal data structures of source programs • interface between front-end and compiler later parts • Abstract syntax trees design is language-dependent • more art than science

Abstract Syntax Trees