CS 3304 Comparative Languages

CS 3304Comparative Languages • Lecture 6:Semantic Analysis • 2 February 2012

Introduction • Context-free grammar can become very complex if trying to do too much. • A rule that requires the compiler to compare things that are separated by long distances, or to count things that are not properly nested, ends up being a matter of semantics, • Semantics rules: • Static: enforced by the compiler at compile time. • Dynamic: enforced by the compiler generate code at runtime. • Following parsing, the next two phases of the “typical” compiler are: • Semantic analysis. • (Intermediate) code generation. • Described in terms of annotations (attributes) of a parse tree. • Attribute grammars provide a formal framework.

Semantic Analyzer • The principal job of the semantic analyzer is to enforce static semantic rules: • Constructs a syntax tree (usually first). • Information gathered is needed by the code generator. • This interface is a boundary between the front end and the back end. • There is a considerable variety in the extent to which parsing, semantic analysis, and intermediate code generation are interleaved. • Fully separated phases: a full parse tree, a syntax tree, and semantic check. • Fully interleaved phases: no need to build both pars and syntax trees. • A common approach interleaves construction of a syntax tree with parsing (no explicit parse tree), follows with separate, sequential phases for semantic analysis and code generation.

Dynamic Checks • Many compilers have an option to enable/disable code generation for dynamic checks: questionable. • A consequence of an undetected error in production use is significantly worse than when testing. • Sometimes dynamic checks are cheap: execute in instruction slots that would otherwise go unused. • Sometimes dynamic checks are expensive: pointer arithmetic in C.

Assertions • An assertion is a statement that a specified condition is expected to be true when execution reaches a certain point in the code (Java): assert denominator != 0; • Many languages support assertions via standard library (C): assert(denominator != 0); • Other constructs include invariants, preconditions and post-conditions (Euclid, Eiffel): can cover a potentially large number of places where an assertion would be required. • Semantic checking requires a significant run-time overhead but with recent hardware advances, it becomes very feasible.

Static Analysis • Compile-time algorithms that predict run-time behavior. • It is precise if it allows the compiler to determine whether a given program will always follow the rules: type checking. • Also useful when not precise: a combination of compile time check and code for run time checking. • Static analysis is also used for code improvement: • Alias analysis: when values can be safely cached in registers. • Escape analysis: all references to a value confined to a given context. • Subtype analysis: an OO variable is of a certain subtype. • Unsafe and speculative optimization. • Conservative and optimistic compilers. • Some languages have tighter semantic rules to avoid dynamic checking.

Attribute Grammars • Both semantic analysis and (intermediate) code generation can be described in terms of annotation, or “decoration” of a parse or syntax tree. • Attribute grammars provide a formal framework for decorating such a tree. • We'll start with decoration of parse trees, then consider syntax trees.

Example Grammar • Consider the following LR (bottom-up) grammar for arithmetic expressions made of constants, with precedence and associativity: says nothing about what the program means. E → E + TE → E – TE → TT → T * FT → T / FT → FF → - F F → ( F ) F → const

Example Attribute Grammar SLR(1) • We can turn this into an attribute grammar as follows (similar to Figure 4.1): • E → E + TE1.val = E2.val + T.val • E → E – TE1.val = E2.val - T.val • E → TE.val = T.val • T → T * FT1.val = T2.val * F.val • T → T / FT1.val = T2.val / F.val • T → FT.val = F.val • F → - FF1.val = - F2.val • F → (E)F.val = E.val • F → constF.val = C.val

Attribute Rules • The attribute grammar serves to define the semantics of the input program. • Attribute rules are best thought of as definitions, not assignments. • They are not necessarily meant to be evaluated at any particular time, or in any particular order, though they do define their left-hand side in terms of the right-hand side.

Annotation • The process of evaluating attributes is called annotation, or decoration, of the parse tree (see Figure 4.2 for (1+3)*2): • When a parse tree under this grammar is fully decorated, the value of the expression will be in the val attribute of the root. • The code fragments for the rules are called semantic functions: • Strictly speaking, they should be cast as functions, e.g.:E1.val = sum (E2.val, T.val). • This is a very simple attribute grammar: • Each symbol has at most one attribute: the punctuation marks have no attributes. • These attributes are all so-called synthesized attributes: • They are calculated only from the attributes of things below them in the parse tree.

Decoration of a Parse Tree

Evaluating Attributes • In general, we are allowed both synthesized and inherited attributes. • Inherited attributes may depend on things above or to the side of them in the parse tree. • Tokens have only synthesized attributes, initialized by the scanner (name of an identifier, value of a constant, etc.). • Inherited attributes of the start symbol constitute run-time parameters of the compiler.

Synthesized Attributes • The S-attributed grammar uses only synthesized attributes. • Its attribute flow (attribute dependence graph) is purely bottom-up. • The arguments to semantic functions in an S-attributed grammar are always attributes of symbols on the right-hand side of the current production. • The return value is always placed into an attribute of the left hand side of the production. • The intrinsic properties of tokens are synthesized attributes initialized by the scanner.

Inherited Attributes • Inherited attributes: values are calculated when their symbol is on the right-hand side of the current production. • Contextual information flow into a symbols for above or from the side: provide different context. • Symbol table information is commonly passed be means of inherited attributes. • Inherited attributes of the root of the parse tree can be used to represent external environment. • Example: left-to-right associativity may create a situation where an S-attributed grammar would be cumbersome to use. By passing attribute values left-to-right in the tree, things are much simpler.

Example Attribute Grammar LL(1) • E → T TTE.v = TT.v TT.st = T.v • TT1 → + T TT2TT1.v = TT2.v TT2.st = TT1.st + T.v • TT1 → - T TT2TT1.v = TT2.v TT2.st = TT1.st - T.v • TT →εTT.v = TT.st • T → F FTT.v = FT.v FT.st = F.v • FT1 → * F FT2FT1.v = FT2.v FT2.st = FT1.st * F.v • FT1 → / F FT2FT1.v = FT2.v FT2.st = FT1.st / F.v • FT →εFT.v = FT.st • F1 → - F2F1.v = - F2.v • F → ( E )F.v = E.v • F → const F.v = C.v

Parse Tree for (1+3)*2

Attribute Flow • Well defined attribute grammar: its rules determine a unique set of values for attributes of every possible parse tree. • Noncircular attribute grammar: it never leads to a parse tree in which there are cycles in the attribute flow graph. • Translation scheme: an algorithm that decorates parse trees by invoking the rules of an attribute grammar in an order consistent with the tree’s attribute flow. • The LL(1) attribute grammar is a good bit messier than the SLR(1) one, but it is still L-attributed (the attributes can be evaluated in a single left-to-right pass over the input): • L-attributed grammars are the most general class of attribute grammars that can be evaluated during an LL parse. • S-attributed grammars are the most general class of attribute grammars that can be evaluated during an LR parse.

Parsers and Attribute Grammars • Each synthetic attribute of a LHS symbol (by definition of synthetic) depends only on attributes of its RHS symbols. • A bottom-up parser: in general paired with an S-attributed grammar. • Each inherited attribute of a RHS symbol (by definition of L-attributed) depends only on: • Inherited attributes of the LHS symbol, or • Synthetic or inherited attributes of symbols to its left in the RHS. • A top-down parser: in general paired with an L-attributed grammar. • There are certain tasks, such as generation of code for short-circuit Boolean expression evaluation, that are easiest to express with non-L-attributed attribute grammars. • Because of the potential cost of complex traversal schemes, however, most real-world compilers insist that the grammar be L-attributed.

Building a Syntax Tree • If intermediate code generation is interleaved with the parsing, no need to build a syntax tree. • One-pass compiler: a compiler that interleaves semantic analysis and code generation with parsing. • Semantic analysis is easier to perform during a separate traversal of a syntax tree: • Add attribute rules to the context-free grammar. • The attributes point to nodes of a syntax tree. • Two examples: • Bottom-up attribute grammar. • Top-down attribute grammar.

Bottom-Up Attribute Grammar

Two Leaves: Constants 1 and 3

First Internal Node: 1+3

Third Leaf: Constant 2

Second Internal Node: (1+3)*2

Top-Down Attribute Grammar

First Leaf: Constant 1

Second Leaf: Constant 3

Third Leaf: Constant 2

ANTLR Grammar: Program.g grammar Program; program: statement+ ; statement: expression NEWLINE | ID '=' expression NEWLINE | NEWLINE ; expression: multiplicationExpression (('+'|'-') multiplicationExpression)* ; multiplicationExpression: atom ('*' atom)* ; atom: INT | ID | '(' expression ')' ; ID: ('a'..'z'|'A'..'Z')+ ; INT: '0'..'9'+ ; NEWLINE: '\r'? '\n' ; WS: (' '|'\t')+ {skip();} ;

Using ANTLR • From Generate Menu select Generate Code menu item. • In gencode subdirectory three files are generated: • Program.tokens: The list of token-name, token-type assignments • ProgramLexer.java: The lexer (scanner) generated from Program.g. • ProgramParser.java: The parser generated from Program.g. • Create a tester class (with main), e.g. RunProgram.java. • Compile and run:javacRunProgram.javaProgramParser.javaProgramLexer.javajava RunProgram • Make sure that the ANTLR jar file is in your class path or included in your Java installation. • ProgramEvaluation.g adds evaluation statement (in Java) to Program.g (attribute grammar).

Main Class: RunProgram.java import org.antlr.runtime.*; public class RunProgram { public static void main(String[] args) throws Exception { ProgramParser parser = new ProgramParser( new CommonTokenStream( new ProgramLexer( new ANTLRInputStream(System.in) ) ) ); parser.program(); } }

Evaluate: ProgramEvaluation.g I grammar ProgramEvaluation; @header { import java.util.HashMap; } @members { HashMapsymbolTable = new HashMap(); } program: statement+ ; statement: expression NEWLINE {System.out.println($expression.value);} | ID '=' expression NEWLINE {symbolTable.put($ID.text, new Integer($expression.value));} | NEWLINE ;

Evaluate: ProgramEvaluation.g II expressionreturns [intvalue] : e=multiplicationExpression {$value = $e.value;} ('+' e=multiplicationExpression {$value += $e.value;} | '-' e=multiplicationExpression {$value -= $e.value;} )* ; multiplicationExpressionreturns [intvalue] : e=atom {$value = $e.value;} ('*' e=atom {$value *= $e.value;} )* ;

Evaluate: ProgramEvaluation.g II atom returns [int value] : INT {$value = Integer.parseInt($INT.text);} | ID {Integer v = (Integer)symbolTable.get($ID.text); if ( v!=null ) $value = v.intValue(); else System.err.println("undefined variable "+$ID.text); } | '(' expression ')' {$value = $expression.value;} ; ID: ('a'..'z'|'A'..'Z')+ ; INT: '0'..'9'+ ; NEWLINE: '\r'? '\n' ; WS: (' '|'\t'|'\n'|'\r')+ {skip();} ;

Summary • The enforcement of static semantic rules and the generation of intermediate code can be cast in terms of annotation of a parse tree. • An attribute grammar associates attributes with each symbol in a context-free grammar and attribute rules which each production. • Attributes are either synthesized(bottom-up) or inherited (top-down). • The syntax tree is a tree representation of the abstract syntactic structure of source code.

CS 3304 Comparative Languages