370 likes | 586 Views
Compiler Automation Tools. Why Automation?. N programming languages M machines We need N*M compilers 1970’s automatic lexical analyzer (scanner) generator automatic syntax analyzer (parser) generator Code generation module generator developed some generators very recently
E N D
Why Automation? • N programming languages • M machines • We need N*M compilers • 1970’s • automatic lexical analyzer (scanner) generator • automatic syntax analyzer (parser) generator • Code generation module generator • developed some generators very recently • Formalism for code generation is difficult.
Compiler Generator • Sometimes called “Compiler-Compiler” language description Compiler-Compiler machine description Compiler object program source program
Scanner Generator regular expression Scanner Generator action code Scanner Series of Tokens source program
Implementation of Scanner Generator • Recognition of Regular Expression • RE NFA (Non-deterministic Finite Automata) • NFA DFA (Deterministic Finite Automata) • DFA optimization (by reducing the states) • Scanner program generation
Lex • Scanner Generator by M. E. Lesk, 1975 Lex Lex input (by user) lex.yy.c Lex library C Compiler Scanner Series of Tokens source program
Lex Input Structure <definitions> %% <rules> %% <user sub-programs>
Definition Part %{ /* This part is merely copied onto the generated program */ /* data structures, variables, constants used in action codes */ %} name1 substitution1 name2 substitution2 … %% rules… %% user subprogram…
Definition Example L [a-zA-Z] D [0-9] %% {L}({L}|{D})* return xxx; Definitions Rules [a-zA-Z]([a-zA-Z]|[0-9])*
Rules %% R1 A1 R2 A2 … R: regular expressions A: actions
Rule Examples • int printf(“found keyword int\n”); • If the string “int” is found and matched in the input stream, output the message “found keyword int”. • [0-9]+ {nc++; printf(“found a integer constant\n”); } • if any numeric string is matched in the input string, increment nc, and output the string “found a integer constant” Action Regular Expression Regular Expression Action
User Subprograms • Just copied into lex.yy.c without any processing by Lex.
Lex Regular Expressions - 1 • Lex RE := text characters + operator characters • Operators • Escaping • a”*”b a*b // i.e., * is not operator • but a*b {b, ab, aab, aaab, …} • a\+b a+b // escape only for a single character • [ ] : used for “type of the character” • - : range operator • [a-z]: Any single character in the set {a, b, …, z} • [-+0-9]: Minus or plus sign followed by a single digit from 0 to 9. • [A-Za-z0-9]: Any single character in the set {A, B, …, Z, a, b, …z, 0, 1, …, 9} • ^: complement • [^*]: Any single character except * • [^a-zA-Z]: Any single character except Roman alphabet.
Lex Regular Expressions - 2 • Operators • \ : Escape character (for C language) • [ \t\n] : One of space, tab, or newline character • [\40-\176]: One of character from ascii 40 (blank) to ascii 176 (~) • *: zero or more repetition • a*: one of blank, a, aa, aaa, … • [a-z][A-Z]* : e.g.) aA, a, bBBF, … • +: one or more repetition • a+: one of a, aa, aaa, … • | : or • ^: Matched when the string appears at the beginning of a line • ex) ^abc: mached when the string “abc” appears at the beginning of a line • $: end of line
Lex Regular Expressions - 3 • Operators • /: trailing context • ex) ab/cd : “ab” is accepted only when “cd” follows “ab” • .: one of all characters except newline • ?: Selection • ex) ab?c “abc” or “ac”
Lex Regular Expressions - 4 • Exercise: Represent the following tokens with Lex regular expression • Identifiers in C language • answer) [a-zA-Z_][a-zA-Z0-9_]* • Real numbers in C language • answer) [0-9]+”.”[0-9]+(e[+-]?[0-9]+)? • String constants in C language • answer) \”([^\042\134]|”\”.))*\” cf)\042 = “, \134 = \ • Comments in C language • answer) “/*”([^*]|”*”+[^*/])*”*”+”/”
Lex Action Code - 1 • Null statement ex) [ \t\n] ; // for blank, tab, and newline, do nothing • yytext • global variable in Lex • keeping the matched token itself • ex) print out the matched string • [a-z]+ print(“%s”, yytest); • [a-z]+ ECHO
Lex Action Code - 2 • Global variables in Lex • yytext • yyleng: the length of the matched string • Global functions in Lex • yymore() : append the next matched string at the tail of the current yytext • yyless(n): leave n length string in yytext. return back the remainings to the input stream for more processing • yywrap(): call automatically by Lex when it meets the end of the input stream. It returns 1 with the normal case. • yylex(): takes one character from the input stream. It tries to match the current read token with the action code rules. If any matched rule is found, it return the return value in terms of the action code rule.
Lex Action Code - 3 • I/O Functions in Lex • input(): get the next character from the input stream. • output(c): output the character c into the output stream. • unput(c): return back already read character c into input stream. The character will be read again by input().
Example • Convert “int” into “integer”, “{“ into “begin”, and “}” into “end” in the input stream %{ #include <stdio.h> %} %% int printf(“integer”); “{“ printf(“begin”); “}” printf(“end”);
Which Rules? • When Lex finds two or more rules for matching the current token • rule 1: Lex takes the rule to recognize more lengthy token. • rule 2: If the rules recognize the same length token, Lex takes the firstly defined rule. ex) integer printf(“Keyword integer\n”); [a-z]+ printf(“Identifier => %s\n”, yytext); when input stream is “… integers … “, Lex takes the second one when input stream is “… integer …”, Lex takes the first one
Lex Scanner Example (1) • Lex File: “test.l” %{ #include <stdio.h> #include <stdlib.h> enum tnumber {TEOF, TIDEN, TNUM, TASSIGN, TADD, TSEMI, TDOT, TBEGIN, TEND, TERROR}; %} letter [a-zA-Z] digit [0-9] %% begin return(TBEGIN); /* return value of yylex() */ end return(TEND); {letter}({letter}|{digit})* return(TIDENT); …
Lex Scanner Example (2) • Lex File: “test.l” (continued) %% void main() { enum tnumber tn; /*token number */ printf(“Start of Lex\n”); while ((tn = yylex()) != TEOF) { switch (tn) { case TBEGIN: printf(“Begin\n”); break; case TEND: printf(“End\n”); break; … } }
Lex Scanner Example (3) • Lex File: “test.l” (continued) int yywrap() { printf(“ End of Lex\n”); return 1; }
Lex Scanner Example (4) • Data File: “test.dat” begin num := 0; num := num + 526; end. • Making the scanner lex test.l generates lex.yy.c generate scanner.exe by linking lex.yy.c with lex library scanner < test.dat output generated
Lex Scanner Example (5) • Scan Result Start of Lex Begin Identifier: num Assignment_op Number: 0 Semicolon Identifier: num … End of Lex
Parser Generator • PGS (Parser Generating System) • PGS input • BNF or EBNF context-free grammar PGS BNF/EBNF Grammar Parsing Table Parser Program Structure Token Stream
YACC • YACC (Yet Another Compiler Compiler) • Stephen C. Johnson, Bell-Lab, 1975 • LALR(1) Parser generator YACC YACC spec (*.y) y.tab.c Yacc library C Compiler Parser Parser Output Token Stream
YACC Specification File <definition part> %% <production rules> %% <user program part>
Production Rules • A: BODY; • Example • BNF <expression>:==<expression>+<term> • YACC expression : expression ‘+’ term; • Example exp : exp ‘+’ term | exp ‘-’ term ; A : ; /* rule for A */
Token Names • Tokens (terminals) should be predefined in <definition part> (They are also passed from scanner.) %token name1 anem2 … • Example %token TVAR %% var_dcl : TVAR var_def ‘;’;
Start Symbol • Start symbol can be explicitly defined in <definition part>: %start symbol_name • If no start symbol is explicitly defined, the lhs of the first production rule will be the start symbol.
Semantic Action • Semantic action will be activated when the corresponding production rule is accepted by the parser. • Example exp: exp ‘+’ term {printf(“addition exp detected\n”); }; exp: term {printf(“simple exp detected\n”); };
Pseudo Variables • $1: the first symbol in rhs • $2: the second symbol in rhs • … • $$: the symbol in lhs • Example factor: ‘-’ factor {$$ = -$2;} | ‘(‘ exp ‘)’ {$$ = $2}; | NUMBER {$$ = $1} ;
Intermediate Action Codes X: Y { f();} Z; …
Ambiguous Grammar • YACC treats the ambiguous grammar using “right-precedence” rule term : term ‘*’ term; • term*term*term = term * (term * term)
Implementations of Lex and Yacc • AT&T Lex & Yacc • UNIX • AT&T, 1975 • Berkeley Lex & Yacc • UNIX (BSD version) • GNU Bison & Flex • Bison: GNU Yacc • Flex: GNU Lex • Free soruce code • …