ANTLR v3 Overview (for ANTLR v2 users)

ANTLR v3 Overview(for ANTLR v2 users) Terence Parr University of San Francisco

Topics • Information flow • v3 grammars • Error recovery • Attributes • Tree construction • Tree grammars • Code generation • Internationalization • Runtime support

Block Info Flow Diagram

Grammar Syntax header {…} /** doc comment */ kind grammar name; options {…} tokens {…} scopes… action rules… /** doc comment */rule[String s, int z] returns [int x, int y] throws E options {…}scopes init {…} :  |  ;exceptions Trees ^(root child1 … childN) Note: No inheritance

Grammar improvements • Single element EBNF like ID* • Combined parser/lexer • Allows ‘c’ and “literal” literals • Multiple parameters, return values • Labels do not have to be unique(x=ID|x=INT) {…$x…} • For combined grammars, warns when tokens are not defined

Example Grammar grammar SimpleParser; program : variable* method+ ; variable: "int" ID (‘=‘ expr)? ';’ ; method : "method" ID '(' ')' '{' variable* statement+ '}' ; statement : ID ‘=‘ expr ';' | "return" expr ';' ; expr : ID | INT ; ID : ('a'..'z'|'A'..'Z')+ ; INT : '0'..'9'+ ; WS : (' '|'\t'|'\n')+ {channel=99;} ;

Using the parser CharStream in = new ANTLRFileStream(“inputfile”); SimpleParserLexer lexer = new SimpleParserLexer(in); CommonTokenStream tokens = new CommonTokenStream(lexer); SimpleParser p = new SimpleParser(tokens); p.program(); // invoke start rule

Improved grammar warnings • they happen less often ;) • internationalized (templates again!) • gives (smallest) sample input sequence • better recursion warnings

Recursion Warnings a : a A | B ; t.g:2:5: Alternative 1 discovers infinite left-recursion to a from a // with -Im 0 (secret internal parameter) a : b | B ; b : c ; c : B b ; t.g:2:5: Alternative 1: after matching input such as B decision cannot predict what comes next due to recursion overflow to c from b

Nondeterminisms t.g:2:5: Decision can match input such as "A B" using multiple alternatives: 1, 2 As a result, alternative(s) 2 were disabled for that input t.g:2:5: The following alternatives are unreachable: 2 a : (A B|A B) C ; a : (A+ B|A+ B) C ; t.g:2:5: Decision can match input such as "A B" using multiple alternatives: 1, 2

Runtime Objects of Interest • Lexer passes all tokens to the parser, but parser listens to only a single “channel”; channel 99, for example, where I place WS tokens, is ignored • Tokens have start/stop index into single text input buffer • Token is an abstract class • TokenSourceanything answering nextToken() • TokenStreamstream pulling from TokenSource; LT(i), … • CharStreamsource of characters for a lexer; LT(i), …

Error Recovery • ANTLR v3 does what Josef Grosch does in Cocktail • Does single token insertion or deletion if necessary to keep going • Computes context-sensitive FOLLOW to do insert/delete • proper context is passed to each rule invocation • knows precisely what can follow reference to r rather than what could follow any reference to r (per Wirth circa 1970)

Example Error Recovery int i = 0; method foo( { int j = i; i = 4 } [program, method]: line 2:12 mismatched token: [@14,23:23='{',<14>,2:12]; expecting type ')' [program, method, statement]: line 5:0 mismatched token: [@31,46:46='}',<15>,5:0]; expecting type ';' One token insertion int i = 0; method foo() ) { int j = i; i = = 4; } [program, method]: line 2:13 mismatched token: [@15,24:24=')',<13>,2:13]; expecting type '{' [program, method, statement, expr]: line 4:6 mismatched token: [@32,47:47='=',<6>,4:6]; expecting set null One token deletion Note: I put in two errors each so you’ll see it continues properly

Attributes • New label syntax and multiple return values • Unified token, rule, parameter, return value, tree reference syntax in actions • Dynamically scope attributes! a[String s] returns [float y] : id=ID f=field (ids+=ID)+ {$s, $y, $id, $id.text, $f.z; $ids.size();} ; field returns [int x, int z] : … ;

Label properties • Token label reference properties • text, type, line, pos, channel, index, tree • Rule label reference properties • start, stop; indices of token boundaries • tree • text; text matched for whole rule

Rule Scope Attributes • A rule may define a scope of attributes visible to any invoked rule; operates like a stacked global variable • Avoids having to pass a value down methodscope { String name; } : "method" ID '(' ')' {$name=$ID.text;} body ; body: '{' stat* '}’ ; … atom init {… $method.name …} : ID | INT ;

Global Scope Attributes • Named scopes; rules must explicitly request access scope Symbols { List names; } {int level=0;} globals scope Symbols; init { level++; $Symbols.names = new ArrayList(); } : decl* {level--;} ; block scope Symbols; init { level++; $Symbols.names = new ArrayList(); } : '{' decl* stat* '}’ {level--;} ; decl : "int" ID ';' {$Symbols.names.add($ID);} ; *What if we want to keep the symbol tables around after parsing?

Tree Support • TreeAdaptor; How to create and navigate trees (like ASTFactory from v2); ANTLR assumes tree nodes are Object type • Tree; used by support code • BaseTree; List of children, w/o payload (no more child-sibling trees) • CommonTree; node wrapping Token as payload • ParseTree; used by interpreter to build trees

Tree Construction • Automatic mechanism is same as v2 except ^ is now ^^expr : atom ( '+'^^ atom )* ; • ^ implies root of tree for enclosing subrulea : ( ID^ INT )* ; builds (a 1) (b 2) … • Token labels are $label not #label and rule invocation tree results are $ruleLabel.tree • Turn onoptions {output=AST;}(one can imagine output=text for templates) • Option: ASTLabelType=CommonTree;

Tree Rewrite Rules • Maps an input grammar fragment to an output tree grammar fragment variable : type declarator ';' -> ^(VAR_DEF type declarator) ; functionHeader : type ID '(' ( formalParameter ( ',' formalParameter )* )? ')' -> ^(FUNC_HDR type ID formalParameter+) ; atom : … | '(' expr ')' -> expr ;

Mixed Rewrite/Auto Trees • Alternatives w/o -> rewrite use automatic mechanism b : ID INT -> INT ID | INT // implies -> INT ;

Rewrites and labels • Disambiguates element references or used to construct imaginary nodes • Concatenation += labels useful too: forStat : "for" '(' start=assignStat ';' expr ';' next=assignStat ')' block -> ^("for" $start expr $next block) ; block : lc='{' variable* stat* '}’ -> ^(BLOCK[$lc] variable* stat*) ; /** match string representation of tree and build tree in memory */ tree : ‘^’ ‘(‘ root=atom (children+=tree)+ ‘)’ -> ^($root $children) | atom ;

Loops in Rewrites • Repeated elementID ID -> ^(VARS ID+)yields ^(VARS a b) • Repeated treeID ID -> ^(VARS ID)+yields ^(VARS a) ^(VARS b) • Multiple elements in loop need same size ID INT ID INT -> ^( R ID ^( S INT) )+yields(R a (S 1)) (R b (S 2)) • Checks cardinality + and * loops

Preventing cyclic structures • Repeated elements get duplicateda : INT -> INT INT ; // dups INT!a : INT INT -> INT+ INT+ ; // 4 INTs! • Repeated rule references get duplicateda : atom -> ^(atom atom) ; // no cycle! • Duplicates whole tree for all but first ref to an element; here 2nd ref to atom results in a duplicated atom tree • *Useful example “int x,y” -> “^(int x) ^(int y)”decl : type ID (‘,’ ID)* -> ^(type ID)+ ; *Just noticed a bug in this one ;)

Predicated rewrites • Use semantic predicate to indicate which rewrite to choose from a : ID INT -> {p1}? ID -> {p2}? INT -> ;

Misc Rewrite Elements • Arbitrary actionsa : atom -> ^({adaptor.createToken(INT,"9")} atom) ; • rewrite always sets the rule’s AST not subrule’s • Reference to previous value (useful?) b : "int" ( ID -> ^(TYPE "int" ID) | ID '=' INT -> ^(TYPE "int" ID INT) ) ; a : (atom -> atom) (op='+' r=atom -> ^($op $a $r) )* ;

Tree Grammars • Syntax same as parser grammars, add^(root children…) tree element • Uses LL(*) also; even derives from same superclass! Tree is serialized to include DOWN, UP imaginary tokens to encode 2D structure for serial parser variable : ^(VAR_DEF type ID) | ^(VAR_DEF type ID ^(INIT expr)) ;

Code Generation • Uses StringTemplate to specify how each abstract ANTLR concept maps to code; wildly successful! • Separates code gen logic from output; not a single character of output in the Java code • Java.stg: 140 templates, 1300 lines

Sample code gen templates /** Dump the elements one per line and stick in debugging * location() trigger in front. */ element() ::= << <if(debug)> dbg.location(<it.line>,<it.pos>);<\n> <endif> <it.el><\n> >> /** match a token optionally with a label in front */ tokenRef(token,label,elementIndex) ::= << <if(label)> <label>=input.LT(1);<\n> <endif> match(input,<token>,FOLLOW_<token>_in_<ruleName><elementIndex>); >>

Internationalization • ANTLR v3 uses StringTemplate to display all errors • Senses locale to load messages;en.stg: 76 templates • ErrorManager error number constants map to a template name; e.g., RULE_REDEFINITION(file,line,col,arg) ::= "<loc()>rule <arg> redefinition” /* This factors out file location formatting; file,line,col inherited from * enclosing template; don't manually pass stuff in. */ loc() ::= "<file>:<line>:<col>: "

Runtime Support • Better organized, separated:org.antlr.runtimeorg.antlr.runtime.treeorg.antlr.runtime.debug • Clean; Parser has input ptr only (except error recovery FOLLOW stack); Lexer also only has input ptr • 4500 lines of Java code minus BSD header

Summary • v3 kicks ass • it sort of works! • http://www.antlr.org/download/… • ANTLRWorks progressing in parallel

ANTLR v3 Overview (for ANTLR v2 users)