670 likes | 687 Views
CS510. Compiler Lecture 2. Phase Ordering of Front-Ends. Lexical analysis (lexer) Break input string into “words” called tokens Syntactic analysis (parser) Recover structure from the text and put it in a parse tree Semantic Analysis Discover “meaning” (e.g., type-checking)
E N D
CS510 Compiler Lecture 2
Phase Ordering of Front-Ends • Lexical analysis (lexer) • Break input string into “words” called tokens • Syntactic analysis (parser) • Recover structure from the text and put it in a parse tree • Semantic Analysis • Discover “meaning” (e.g., type-checking) • Prepare for code generation • Works with a symbol table
What is a Token? • A syntactic category • In English: • Noun, verb, adjective, … • In a programming language: • Identifier, Integer, Keyword, White-space, … • A token corresponds to a set of strings
Terms • Token • Syntactic “atoms” that are “terminal” symbols in the grammar from the source language • A data structure (or pointer to it) returned by lexer • Pattern • A “rule” that defines strings corresponding to a token • Lexeme • A string in the source code that matches a pattern
Terms Token classes or pattern correspond to sets of strings. • Identifier: • –strings of letters or digits, starting with a letter • Integer: • a non-empty string of digits • Keyword: • –“else” or “if” or “begin” or … • Whitespace: • –a non-empty sequence of blanks, newlines, and tabs
An implementation must do two things: 1.Recognize substrings corresponding to tokens • The lexemes 2.Identify the token class of each lexeme
An Example of these Terms • int foo = 100; • The lexeme matched by the pattern for the token represents a string of characters in the source program that can be treated as a lexical unit
Example x = x*c^2 <token class or Pattern, lexeme> = token <id,x> <assign-op ,=> <id,x> <mult-op,*> <id,c> <exp-op,^> <num,2>
Lexical Analysis Problem • Partition input string of characters into disjoint substrings which are tokens • Useful tokens here: identifier, keyword, relop, integer, white space, (, ), =, ;
Designing Lexical Analyzer • First, define a set of tokens • Tokens should describe all items of interest • Choice of tokens depends on the language and the design of the parser • Then, describe what strings belongs to each token by providing a pattern for it
Main points 1.The goal is to partition the string. This is implemented by reading left-to-right, recognizing one token at a time • Partition the input string into lexemes • Identify the token of each lexeme 2.“Lookahead” may be required to decide where one token ends and the next token begins
Review • Left-to-right scan, sometimes requiring lookahead • We still need • A way to describe the lexemes of each token: pattern • A way to resolve ambiguities • Is “==“ two equal signs “=“ “=“ or a single relational op?
The Scanning Process • Input: source code as a stream of characters • Output: meaningful units called Tokens • Keywords: such as IF and THEN which represent the strings of characters “if” and “then” • Symbols: such as PLUS and MINUS which represent the characters “+” and “-” • Variable tokens: such as ID and NUM which represent identifiers and numbers
The Scanning Process • Although the task of the scanner is to convert the entire source program into a sequence of tokens, the scanner will rarely do this all at once • Instead, the scanner will operate under the control of the parser, returning the single next token from the input on demand via a function that will have the C declaration: TokenType getToken();
The Scanning Process • The scanning process is a special case of pattern matching, so we need to study methods of: • Pattern specification We will use regular expressions for pattern specification • Pattern recognition We will use finite automata for recognizing patterns • We turn now to the study of regular expressions and finite automata
Topics • Regular Expressions • Finite Automata • Implementation of DFA
Regular Expression • Regular Expression • Regular Language • Regular Definition
Regular Expressions • There are several formalisms for specifying tokens but the most popular one is “regular languages” • A regular expression is an expression that matches sets of strings (the “language” of the regular expression)
Formal Language Theory • Alphabet∑ : a finite set of symbols (characters) • Ex: {a,b}, an ASCII character set • String: a finite sequence of symbols over ∑ • Ex: abab, aabb, a over {a,b}; “hello” over ASCII • Empty string є: zero-length string • є≠Ø≠ {є} • Language: a set of strings over ∑ • Ex: {a, b, abab} over {a,b} • Ex: a set of all valid C programs over ASCII
Regular Expressions • In its basic form, a regular expression is built up out of basic expressions (individual symbols) and the operations: • choice (|), • concatenation (no operator), and • repetition (*)
Operations on Strings • Concatenation (·): • a · b = ab, “hello” · ”there” = “hellothere” • Denoted by α· β = αβ • Exponentiation: • hello3 = hello · hello · hello = hellohellohello, hello0 = є
Regular Expressions (Continued) • Examples of Regular Expressions • a(b|c)* : strings beginning with a, followed by any number of b’s and c’s • ab|c* : the string ab, or the strings , c, cc, …. • (a|c)*b(a|c)*: strings contain exactly one b • (b|)(a|ab)* : strings of a’s and b’s with no two consecutive b’s
Regular Languages over ∑ • Definition of regular languages over ∑ • Ø is regular • {a} is regular • {є} is regular • R∪S is regular if R, S are regular • R·S is regular if R, S are regular • Nothing else
Regular Expressions • Easy way to express a language that is accepted by FSA • Rules: • ε is a regular expression • Any symbol in Σ is a regular expression If r and s are any regular expressions then so is: • r|s denotes union e.g. “r or s” • rs denotes r followed by s (concatination) • (r)* denotes concatination of r with itself zero or more times (Kleene closer) • () used for controlling order of operations
Regular Expressions (Continued) • Common Extensions of R.E. operations: • + : one or more repetitions of (r+ is equivalent to rr*) • [] : range of characters ([abcd] = [a-d] = a|b|c|d) • . : any character (.*b.*:strings that contain at least one b) • ^ : negate a set ([^abc]= any character excepta, b, orc) • ? : optional sub-expression ((+|-)?[0-9] = [0-9] | +[0-9] | -[0-9]) • \ : “escape” an operator or meta-symbol
Example reg.exp: a|b language : {a,b} (a|b)(a|b) {aa,ab, ba, bb} a* {€,a, aa, aaa, …..}
You Try • What is the language described by each Regular Expression? a* (a|b)* a|a*b (a|b)(a|b) aa|ab|ba|bb (+|-|ε)(0|1|2|3|4|5|6|7|8|9)*
Exercise Which of the following matches regexp /a(ab)*a/ • 1) abababa • 2) aaba • 3) aabbaa • 4) aba • 5) aabababa
Which of the following matches Reg-exp : /a.[bc]+/ • 1) abc • 2) abbbbbbbb • 3) azc • 4) abcbcbcbbc • 5) ac • 6) asccbbbbcbcccc
RE Shorthands • r? = r|Є (zero or one instance of r) • r+ = r·r* (positive closure) • Charater class: [abc] = a|b|c, [a-z] = a|b|c|…|z • Ex: ([ab]c?)+ = {a, b, aa, ab, ac, ba, bb, bc,…}
Regular Definitions If Σ is an alphabet of basic symbols, then a regular definition is a sequence of definitions of the form: D1→ R1 D2 → R2 … Dn→ Rn • Each di is a new symbol not in Σ andnot the same as any other of the d’s. • Each ri is a regular expression over ΣU(d1,d2,…,di-1)
Regular Definitions Example Example C identifiers: Σ = ASCII letter_→ a|b|c|…|z|A|B|C|…|Z|_ digit→ 0|1|2|…|9 id→letter_(letter_|digit)*
Regular Definitions Example Example Unsigned Numbers (integer or float): Σ = ASCII digit→ 0|1|2|…|9 digits→digit digit* optionalFraction→ . digits | ε optionalExponent → (E(+|-| ε)digits)| ε number→digits optionalFraction optionalExponent
Patterns for token • digit [0-9] • digits digit+ • Letter [A –Z a-z] • If if • then then • else else • Relop <|<=|=|<>|>|>= • Idletter(letter|digit)* • Numdigits+ (.digits+)?(E(+|-)?digits+)? • ws (blank|tab|newline)+
Transition Diagram • A transition diagram is a stylized flowchart. • Transition diagram is used to keep track of information about characters that are seen as the forward pointer scans the input. • We do so by moving from position to position in the diagrams as characters are read.
Transition Diagram • Have a collection of circles called states . • Each state represents a condition that could occur during the process of scanning the input looking for a lexeme that matches one of several patterns . • Start state • Accept state • Any state
Transition Diagram • Edge : represent transitions from one state to another as caused by the input
Finite automata: why? • Regular expressions = specification • Finite automata = implementation
Exercise • A finite automaton that accepts only “a” • Language of FS is the set of accepted states
Finite Automata • Two types – both describe what are called regular languages • Deterministic (DFA) – There is a fixed number of states and we can only be in one state at a time • Nondeterministic (NFA) –There is a fixed number of states but we can be in multiple states at one time • NFA and DFA stand for nondeterministic finite automaton and deterministic finite automaton, respectively.
State Diagram 1 0 0 q3 q1 q2 1 0, 1
Finite Automata • Transition • s1 a s2 Is read • In state s1 on input a go to state s2 • •If end of input and in accepting state => accept • •Otherwise => reject