1 / 67

CS510

CS510. Compiler Lecture 2. Phase Ordering of Front-Ends. Lexical analysis (lexer) Break input string into “words” called tokens Syntactic analysis (parser) Recover structure from the text and put it in a parse tree Semantic Analysis Discover “meaning” (e.g., type-checking)

monroel
Download Presentation

CS510

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS510 Compiler Lecture 2

  2. Phase Ordering of Front-Ends • Lexical analysis (lexer) • Break input string into “words” called tokens • Syntactic analysis (parser) • Recover structure from the text and put it in a parse tree • Semantic Analysis • Discover “meaning” (e.g., type-checking) • Prepare for code generation • Works with a symbol table

  3. What is a Token? • A syntactic category • In English: • Noun, verb, adjective, … • In a programming language: • Identifier, Integer, Keyword, White-space, … • A token corresponds to a set of strings

  4. Terms • Token • Syntactic “atoms” that are “terminal” symbols in the grammar from the source language • A data structure (or pointer to it) returned by lexer • Pattern • A “rule” that defines strings corresponding to a token • Lexeme • A string in the source code that matches a pattern

  5. Terms Token classes or pattern correspond to sets of strings. • Identifier: • –strings of letters or digits, starting with a letter • Integer: • a non-empty string of digits • Keyword: • –“else” or “if” or “begin” or … • Whitespace: • –a non-empty sequence of blanks, newlines, and tabs

  6. An implementation must do two things: 1.Recognize substrings corresponding to tokens • The lexemes 2.Identify the token class of each lexeme

  7. An Example of these Terms • int foo = 100; • The lexeme matched by the pattern for the token represents a string of characters in the source program that can be treated as a lexical unit

  8. Example x = x*c^2 <token class or Pattern, lexeme> = token <id,x> <assign-op ,=> <id,x> <mult-op,*> <id,c> <exp-op,^> <num,2>

  9. Lexical Analysis Problem • Partition input string of characters into disjoint substrings which are tokens • Useful tokens here: identifier, keyword, relop, integer, white space, (, ), =, ;

  10. Designing Lexical Analyzer • First, define a set of tokens • Tokens should describe all items of interest • Choice of tokens depends on the language and the design of the parser • Then, describe what strings belongs to each token by providing a pattern for it

  11. Main points 1.The goal is to partition the string. This is implemented by reading left-to-right, recognizing one token at a time • Partition the input string into lexemes • Identify the token of each lexeme 2.“Lookahead” may be required to decide where one token ends and the next token begins

  12. Review • Left-to-right scan, sometimes requiring lookahead • We still need • A way to describe the lexemes of each token: pattern • A way to resolve ambiguities • Is “==“ two equal signs “=“ “=“ or a single relational op?

  13. The Scanning Process • Input: source code as a stream of characters • Output: meaningful units called Tokens • Keywords: such as IF and THEN which represent the strings of characters “if” and “then” • Symbols: such as PLUS and MINUS which represent the characters “+” and “-” • Variable tokens: such as ID and NUM which represent identifiers and numbers

  14. The Scanning Process • Although the task of the scanner is to convert the entire source program into a sequence of tokens, the scanner will rarely do this all at once • Instead, the scanner will operate under the control of the parser, returning the single next token from the input on demand via a function that will have the C declaration: TokenType getToken();

  15. The Scanning Process • The scanning process is a special case of pattern matching, so we need to study methods of: • Pattern specification We will use regular expressions for pattern specification • Pattern recognition We will use finite automata for recognizing patterns • We turn now to the study of regular expressions and finite automata

  16. Topics for scanning Process

  17. Topics • Regular Expressions • Finite Automata • Implementation of DFA

  18. Regular Expression • Regular Expression • Regular Language • Regular Definition

  19. Regular Expressions • There are several formalisms for specifying tokens but the most popular one is “regular languages” • A regular expression is an expression that matches sets of strings (the “language” of the regular expression)

  20. Formal Language Theory • Alphabet∑ : a finite set of symbols (characters) • Ex: {a,b}, an ASCII character set • String: a finite sequence of symbols over ∑ • Ex: abab, aabb, a over {a,b}; “hello” over ASCII • Empty string є: zero-length string • є≠Ø≠ {є} • Language: a set of strings over ∑ • Ex: {a, b, abab} over {a,b} • Ex: a set of all valid C programs over ASCII

  21. Regular Expressions • In its basic form, a regular expression is built up out of basic expressions (individual symbols) and the operations: • choice (|), • concatenation (no operator), and • repetition (*)

  22. Operations on Strings • Concatenation (·): • a · b = ab, “hello” · ”there” = “hellothere” • Denoted by α· β = αβ • Exponentiation: • hello3 = hello · hello · hello = hellohellohello, hello0 = є

  23. Regular Expressions (Continued) • Examples of Regular Expressions • a(b|c)* : strings beginning with a, followed by any number of b’s and c’s • ab|c* : the string ab, or the strings , c, cc, …. • (a|c)*b(a|c)*: strings contain exactly one b • (b|)(a|ab)* : strings of a’s and b’s with no two consecutive b’s

  24. Regular Languages over ∑ • Definition of regular languages over ∑ • Ø is regular • {a} is regular • {є} is regular • R∪S is regular if R, S are regular • R·S is regular if R, S are regular • Nothing else

  25. Regular Expressions • Easy way to express a language that is accepted by FSA • Rules: • ε is a regular expression • Any symbol in Σ is a regular expression If r and s are any regular expressions then so is: • r|s denotes union e.g. “r or s” • rs denotes r followed by s (concatination) • (r)* denotes concatination of r with itself zero or more times (Kleene closer) • () used for controlling order of operations

  26. Regular Expressions (Continued) • Common Extensions of R.E. operations: • + : one or more repetitions of (r+ is equivalent to rr*) • [] : range of characters ([abcd] = [a-d] = a|b|c|d) • . : any character (.*b.*:strings that contain at least one b) • ^ : negate a set ([^abc]= any character excepta, b, orc) • ? : optional sub-expression ((+|-)?[0-9] = [0-9] | +[0-9] | -[0-9]) • \ : “escape” an operator or meta-symbol

  27. Example reg.exp: a|b  language : {a,b} (a|b)(a|b)  {aa,ab, ba, bb} a*  {€,a, aa, aaa, …..}

  28. You Try • What is the language described by each Regular Expression? a* (a|b)* a|a*b (a|b)(a|b) aa|ab|ba|bb (+|-|ε)(0|1|2|3|4|5|6|7|8|9)*

  29. Exercise Which of the following matches regexp /a(ab)*a/ • 1)      abababa • 2)      aaba • 3)      aabbaa • 4)      aba • 5)      aabababa

  30. 2 and 5

  31. Which of the following matches Reg-exp : /a.[bc]+/ • 1)      abc • 2)      abbbbbbbb • 3)      azc • 4)      abcbcbcbbc • 5)      ac • 6)      asccbbbbcbcccc

  32. 1

  33. RE Shorthands • r? = r|Є (zero or one instance of r) • r+ = r·r* (positive closure) • Charater class: [abc] = a|b|c, [a-z] = a|b|c|…|z • Ex: ([ab]c?)+ = {a, b, aa, ab, ac, ba, bb, bc,…}

  34. Regular Definitions If Σ is an alphabet of basic symbols, then a regular definition is a sequence of definitions of the form: D1→ R1 D2 → R2 … Dn→ Rn • Each di is a new symbol not in Σ andnot the same as any other of the d’s. • Each ri is a regular expression over ΣU(d1,d2,…,di-1)

  35. Regular Definitions Example Example C identifiers: Σ = ASCII letter_→ a|b|c|…|z|A|B|C|…|Z|_ digit→ 0|1|2|…|9 id→letter_(letter_|digit)*

  36. Regular Definitions Example Example Unsigned Numbers (integer or float): Σ = ASCII digit→ 0|1|2|…|9 digits→digit digit* optionalFraction→ . digits | ε optionalExponent → (E(+|-| ε)digits)| ε number→digits optionalFraction optionalExponent

  37. Finite automate

  38. Patterns for token • digit  [0-9] • digits  digit+ • Letter  [A –Z a-z] • If if • then then • else else • Relop <|<=|=|<>|>|>= • Idletter(letter|digit)* • Numdigits+ (.digits+)?(E(+|-)?digits+)? • ws (blank|tab|newline)+

  39. Transition Diagram • A transition diagram is a stylized flowchart. • Transition diagram is used to keep track of information about characters that are seen as the forward pointer scans the input. • We do so by moving from position to position in the diagrams as characters are read.

  40. Transition Diagram • Have a collection of circles called states . • Each state represents a condition that could occur during the process of scanning the input looking for a lexeme that matches one of several patterns . • Start state • Accept state • Any state

  41. Transition Diagram • Edge : represent transitions from one state to another as caused by the input

  42. Finite automata: why? • Regular expressions = specification • Finite automata = implementation

  43. Definition : Finite Automaton

  44. Exercise • A finite automaton that accepts only “a” • Language of FS is the set of accepted states

  45. Finite Automata • Two types – both describe what are called regular languages • Deterministic (DFA) – There is a fixed number of states and we can only be in one state at a time • Nondeterministic (NFA) –There is a fixed number of states but we can be in multiple states at one time • NFA and DFA stand for nondeterministic finite automaton and deterministic finite automaton, respectively.

  46. Finite Automata

  47. State Diagram 1 0 0 q3 q1 q2 1 0, 1

  48. Finite Automata • Transition • s1 a s2 Is read • In state s1 on input a go to state s2 • •If end of input and in accepting state => accept • •Otherwise => reject

  49. Data Representation

More Related