260 likes | 395 Views
Abstract Syntax Trees. Compiler Baojian Hua bjhua@ustc.edu.cn. Front End. lexical analyzer. source code. tokens. abstract syntax tree. parser. semantic analyzer. IR. Recap. Lexer Program source to token sequence Parser token sequence, and answer Y or N Today’s topic:
E N D
Abstract Syntax Trees Compiler Baojian Hua bjhua@ustc.edu.cn
Front End lexical analyzer source code tokens abstract syntax tree parser semantic analyzer IR
Recap • Lexer • Program source to token sequence • Parser • token sequence, and answer Y or N • Today’s topic: • abstract syntax trees
E E * E 15 ( E ) E + E 3 4 Abstract Syntax Trees • Parse trees encodes the grammatical structure of the source program • However, they contain a lot of unnecessary information • What are essential here?
E E * E 15 ( E ) E + E 3 4 Abstract Syntax Trees • For the compiler to understand an expression, it only need to know operators and operands • punctuations, parentheses, etc. are not needed • Similar for statements, functions, etc.
E E * E 15 ( E ) E + E 3 4 Abstract Syntax Trees Times Int 15 Plus Int 3 Int 4 Parse tree Abstract syntax tree
Concrete and Abstract Syntax • Concrete Syntax is needed for parsing • includes punctuation symbols, factoring, elimination of left recursion, depends on the format of the input • Abstract Syntax is simpler, more convenient internal representation • clean interface between the parser and the later phases of the compiler
S E E + T T T * F x F F 2 3 Concrete and Abstract Syntax 2 + 3 * x E ::= E + T | T T ::= T * F | F F ::= id | num | ( E )
Plus Int 2 Times Int 3 Id x Concrete and Abstract Syntax 2 + 3 * x E ::= id | num | E + E | E * E | ( E )
AST Data Structures • In the compiler, abstract syntax makes use of the implementation language to represent aspects of the grammatical structure • Highly target and implementation languages dependent • arts more than science
AST in SML (* data structures *) datatype exp = Int of int | Id of string | Add of exp * exp | Times of exp * exp E ::= id | num | E + E | E * E | ( E ) (* to encode “2+3*x” *) val prog = Add (Int 2, Times (Int 3, Id “x”)) (* Compile “2+3*x”. To be covered later… *) val x86 = compile (prog)
AST in SML (* calculate number of nodes in an ast *) fun numNodes e = case e of Int _ => 1 | Id _ => 1 | Add (e1, e2) => 1 + numNodes e1 + numNodes e2 | Times (e1, e2) => 1 + numNodes e1 + numNodes e2 (* Note this may be too inefficient, why? *)
AST in SML (* tail-recursion *) fun numNodes (e, n) = case e of Int _ => 1 + n | Id _ => 1 + n | Add (e1, e2) => let val n’ = numNodes (e1, n) in numNodes (e2, 1+n’) end | Times (e1, e2) => …(*similar)
AST in SML (* yet another version using reference *) val nodes = ref 0; val op ++ = fn x => x := !x + 1 fun numNodes e = case e of Int _ => ++ nodes | Id _ => ++ nodes | Add (e1, e2) => (numNodes e1 ; ++ nodes ; numNodes e2) ) | Times (e1, e2) => …(*similar)
AST in C /* data structures */ typedef struct exp *exp; enum expKind {INT, ID, ADD, TIMES}; struct exp { enum expKind kind; union { int i; char *id; struct {exp e1; exp e2;} add; struct {exp e1; exp e2;} times; } u; }; E ::= id | num | E + E | E * E | ( E )
AST in C /* sample program “2+3*x” */ exp e1 = malloc (sizeof (*e1)); e1->kind = INT; e1->u.i = 3; exp e2 = malloc (sizeof (*e2)); e2->kind = ID; e2->u.id = “x”; exp e3 = malloc (sizeof (*e3)); e3->kind = TIMES; e3->u.times.e1 = e1; e2->u.times.e2 = e2; … /* really boring and error-prone :-( */ E ::= id | num | E + E | E * E | ( E )
AST in C (* number of nodes again *) int numNodes (exp e) { switch (e->kind) { case INT: return 1; case ID: return 1; case ADD: case TIMES: return 1+numNodes(e->u.add.e1) +numNodes(e->u.add.e2); default: error (“impossible”); } } Aha, C compiler is stupid!
AST in OO /* data structures */ abstract class Exp {} class Int extends Exp {…} class Id extends Exp {…} class Add extends Exp {…} class Times extends Exp {…} E ::= id | num | E + E | E * E | ( E ) /* to encode “2+3*x” */ Exp prog = new Add (new Int (2), new Times (new Int (3), new Id (“x”))) /* Not so ugly as C, but still boring */
AST in OO (* number of nodes again *) int numNodes (Exp e) { if (e instanceof Int) return 1; else if (e instanceof Id) return 1; else if (e instanceof ADD) { Add f = (Add)e; return 1+numNodes(f.e1)+numNodes(f.e2); } … }
AST Generations • ML-Yacc uses an attribute-grammar scheme • each nonterminal may have a semantic value associated with it • when the parser reduces with (X ::= s1…sn) • a semantic action will be executed • uses semantic values from symbols in si • when parsing completes successfully • parser returns semantic value associated with the start symbol • usually an abstract syntax tree
E Attribute Grammars 2 factor term exp exp + exp + 3 exp + factor exp + term 2 + 3 * 4 + 3 * 4 + 3 * 4 + 3 * 4 + 3 * 4 3 * 4 * 4 * 4 * 4 S + * 2 + T E 3 T * F 2 T 4 Each nonterminal is associated with a tree. 3 F 4 2 F 4 3 2 3 2
Attribute Grammars datatype exp = Id of string | Num of int | Add of exp * exp | Times of exp * exp %% %% e -> e PLUS e (Add (e1, e2)) | e TIMES e (Times (e1, e2)) | ID (Id ID) | NUM (Num NUM)
Source Position • In one-pass compiler, error messages are precise • early compilers never worry about with this • But in a multi-pass compiler, source positions must be stored in AST itself (* Example *) type pos = … datatype exp = Int of int * pos | Id of string * pos | Add of exp * exp * pos | Times of exp * exp * pos
Source Position datatype exp = Id of string * pos | Num of int * pos | Add of exp * exp * pos | Times of exp * exp * pos %% %% e -> e PLUS e (Add (e1, e2, PLUSleft)) | e TIMES e (Times (e1, e2, TIMESleft)) | ID (Id (ID, IDleft)) | NUM (Num (NUM, NUMleft))
Labs • For lab #4, your job is to produce abstract syntax trees from source programs • we’ve offered code skeleton, you should firstly familiarize yourself with it • your job is to understand the “layout” function etc. • and glue the parser by adding semantic actions • Test your compiler carefully to make sure it parses the source programs correctly
Summary • Abstract syntax trees are compiler internal representations of source programs • interface between front-end and compiler later parts • Abstract syntax trees design is language-dependent, and more art than science