470 likes | 552 Views
Foundations of Software Design. Lecture 24: Compilers, Lexers, and Parsers; Intro to Graphs Marti Hearst Fall 2002. Programming Languages. Compiler. Assembly Language. CPU. Address Space. Circuits. Code vs. Data. Gates. Orders of Magnitude. Boolean Logic. Number Systems.
E N D
Foundations of Software Design Lecture 24: Compilers, Lexers, and Parsers; Intro to Graphs Marti Hearst Fall 2002
Programming Languages Compiler Assembly Language CPU Address Space Circuits Code vs. Data Gates Orders of Magnitude Boolean Logic Number Systems How Do Computers Work (Revisited)? Machine Instructions Bits & Bytes Binary Numbers
Compiler The Compiler • What is a compiler? • A recognizer (of some source language L). • A translator (of programs written in L into programs written in some object or target language L'). • A compiler is itself a program, written in some host language • Operates in phases Programming Languages Assembly Language Machine Instructions Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Converting Java to Byte Code • When you compile a java program, javac produces byte codes (stored in the class file). • The byte codes are not converted to machine code. • Instead, they are interpreted in the VM when you run the program called java.
C code Translated by the C compiler (gcc or cc) Assembly Language Creates the JVM once Machine Code Java code Translated by the java compiler (javac or jit) Java Virtual Machine Byte code (class file) Individual program is loaded & run in JVM
Compiler Compilers • Which came first: the compiler or the program? • The very first one has to be written in assembly language! • This is why most programming languages today start with the C code generator • After you have created the first compiler for a given language, say java, then you … • Use that compiler to compile itself!!
Compiling Your Compiler Write the first java compiler using C Write the second java compiler using java Compile using gcc Compile using javac Javac in C Javac in java Write other java programs Compile using javac
Compiler in more detail. Lexical analyzer (scanner) Syntax analyzer (parser) Semantic analyzer Intermediate Code Generator Optimizer Code Generator Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
The Scanner • Task: • Translate the sequence of characters into a corresponding sequence of tokens (by grouping characters into lexemes). • How it’s done • Specify lexemes using Regular Expressions • Convert these Regular Expressions into Finite Automata Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Lexemes and Tokens Here are some Java lexemes and the corresponding tokens: ; = index tmp 37 102 SEMI-COLON ASSIGN IDENT IDENT INT-LIT INT-LIT Note that multiple lexemes can correspond to the same token (e.g., there are many identifiers). Given the source code: position = initial + rate * 60 ; a Java scanner would return the following sequence of tokens: IDENT ASSIGN IDENT PLUS IDENT TIMES INT-LIT SEMI-COLON Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
The Scanner • Also called the Lexer • How it works: • Reads characters from the source program. • Groups the characters into lexemes (sequences of characters that "go together"). • Each lexeme corresponds to a token; • the scanner returns the next token (plus maybe some additional information) to the parser. • The scanner may also discover lexical errors (e.g., erroneous characters). • The definitions of what is a lexeme, token, or bad character all depend on the source language. Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Two kinds of Automata Deterministic (DFA): • No state has more than one outgoing edge with the same label. Non-Deterministic (NFA): • States may have more than one outgoing edge with same label. • Edges may be labeled with (epsilon), the empty string. • The automaton can take an epsilon transition without looking at the current input character. Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Regular Expressions to Finite Automata • Generating a scanner NFA Regular expressions DFA Lexical Specification Table-driven Implementation of DFA Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
BNF • Backus-Naur form, Backus-Normal form • A set of rules (or productions) • Each of which expresses the ways symbols of the language can be grouped together • Non-terminals are written upper-case • Terminals are written lower-case • The start symbol is the left-hand side of the first production • The rules for a CFG are often referred to as its BNF
Java Identifier Definition Described in the Java specification: • http://java.sun.com/docs/books/jls/second_edition/html/lexical.doc.html#44591 • “An identifier is an unlimited-length sequence of Java letters and Java digits, the first of which must be a Java letter. • An identifier cannot have the same spelling (Unicode character sequence) as a keyword (§3.9), Boolean literal (§3.10.3), or the null literal (§3.10.7).” Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Java Identifier Definition Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Java Integer Literals • An integer literal may be expressed in decimal (base 10), hexadecimal (base 16), or octal (base 8) • Examples: 0 2 0372 0xDadaCafe 1996 0x00FF00FF (opt means optional) Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Defining Java Decimal Numerals A decimal numeral is either the single ASCII character 0, representing the integer zero, or consists of an ASCII digit from 1 to 9, optionally followed by one or more ASCII digits from 0 to 9, representing a positive integer: Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Defining Floating-Point Literals A floating-point literal has the following parts: a whole-number part, a decimal point (represented by an ASCII period character), a fractional part, an exponent, and a type suffix. The exponent, if present, is indicated by the ASCII letter e or E followed by an optionally signed integer. Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
The Functionality of the Parser • Input: sequence of tokens from lexical analysis • Output: parse tree of the program • parse tree is generated if the input is a legal program • if input is an illegal program, syntax errors are issued • Note: • Instead of parse tree, some parsers produce directly: • abstract syntax tree (AST) + symbol table, or • intermediate code, or • object code Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Parser vs. Scanner Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
The Parser • Groups tokens into "grammatical phrases", discovering the underlying structure of the source program. • Finds syntax errors. • Example • position = * 5 ; • corresponds to the sequence of tokens: IDENT ASSIGN TIMES INT-LIT SEMI-COLON • All are legal tokens, but that sequence of tokens is erroneous. • Might find some "static semantic" errors, e.g., a use of an undeclared variable, or variables that are multiply declared. • Might generate code, or build some intermediate representation of the program such as an abstract-syntax tree. Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
What must the parser do? • Recognizer: not all strings of tokens are programs • must distinguish between valid and invalid strings of tokens • Translator: must expose program structure • e.g., associativity and precedence • must return the parse tree We need: • A language for describing valid strings of tokens • context-free grammars • (analogous to regular expressions in the scanner) • A method for distinguishing valid from invalid strings of tokens (and for building the parse tree) • the parser • (analogous to the state machine in the scanner) Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Parser Example position = initial + rate * 60 ; = + position * initial rate 60 Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
= + position * initial rate 60 The Semantic Analyzer • The semantic analyzer checks for (more) "static semantic" errors, e.g., type errors. • Annotates and/or changes the abstract syntax tree • (e.g., it might annotate each node that represents an expression with its type). • Example with before and after: (float) = (float) + position (float) (float) * initial (float) rate (float) int- to-float() (float) 60 (int) Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Intermediate Code Generator The intermediate code generator translates from abstract-syntax tree to intermediate code. • One possibility is 3-address code. • Here's an example of 3-address code for the abstract-syntax tree shown above: temp1 = int-to-float(60) temp2 = rate * temp1 temp3 = initial + temp2 position = temp3 Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
int count = 0; for (int j=0; j < 2*5; j++) { int temp = j + 1; count += 3; } The Optimizer • Examine the program and rewrite it in ways the preserve the meaning but are more efficient. • Incredibly complex programs and algorithms • Example • Move the declaration of temp outside the loop so it isn’t re-declared every time the loop is executed • Change 2*5 to 10 since it is a constant (no need to do an expensive multiply at run time) • If we removed the line with temp, the program might even skip the loop altogether • You can see in advance that count ends up = 30 Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
The Code Generator • The code generator generates object code from (optimized) intermediate code. LOADF rate,R1 MULF #60.0,R1 LOADF initial,R2 ADDF R2,R1 STOREF R1,position Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
Tools • Scanner Generator • Used to create a scanner automatically • Input: • a regular expression for each token to be recognized • Output: • a finite state machine • Examples: • lex or flex (produce C code), or jlex (produce java) • Compiler Compilers • yacc (produces C) or JavaCC (produces Java, also has a scanner generator). Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html
What is a Graph? Slide adapted from Goodrich & Tamassia
Next Time • Graph Traversal • Directed Graphs (digraphs) • DAGS • Weighted Graphs Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html