CS510

CS510 Compiler Lecture 2

Phase Ordering of Front-Ends • Lexical analysis (lexer) • Break input string into “words” called tokens • Syntactic analysis (parser) • Recover structure from the text and put it in a parse tree • Semantic Analysis • Discover “meaning” (e.g., type-checking) • Prepare for code generation • Works with a symbol table

What is a Token? • A syntactic category • In English: • Noun, verb, adjective, … • In a programming language: • Identifier, Integer, Keyword, White-space, … • A token corresponds to a set of strings

Terms • Token • Syntactic “atoms” that are “terminal” symbols in the grammar from the source language • A data structure (or pointer to it) returned by lexer • Pattern • A “rule” that defines strings corresponding to a token • Lexeme • A string in the source code that matches a pattern

Terms Token classes or pattern correspond to sets of strings. • Identifier: • –strings of letters or digits, starting with a letter • Integer: • a non-empty string of digits • Keyword: • –“else” or “if” or “begin” or … • Whitespace: • –a non-empty sequence of blanks, newlines, and tabs

An implementation must do two things: 1.Recognize substrings corresponding to tokens • The lexemes 2.Identify the token class of each lexeme

An Example of these Terms • int foo = 100; • The lexeme matched by the pattern for the token represents a string of characters in the source program that can be treated as a lexical unit

Example x = x*c^2 <token class or Pattern, lexeme> = token <id,x> <assign-op ,=> <id,x> <mult-op,*> <id,c> <exp-op,^> <num,2>

Lexical Analysis Problem • Partition input string of characters into disjoint substrings which are tokens • Useful tokens here: identifier, keyword, relop, integer, white space, (, ), =, ;

Designing Lexical Analyzer • First, define a set of tokens • Tokens should describe all items of interest • Choice of tokens depends on the language and the design of the parser • Then, describe what strings belongs to each token by providing a pattern for it

Main points 1.The goal is to partition the string. This is implemented by reading left-to-right, recognizing one token at a time • Partition the input string into lexemes • Identify the token of each lexeme 2.“Lookahead” may be required to decide where one token ends and the next token begins

Review • Left-to-right scan, sometimes requiring lookahead • We still need • A way to describe the lexemes of each token: pattern • A way to resolve ambiguities • Is “==“ two equal signs “=“ “=“ or a single relational op?

The Scanning Process • Input: source code as a stream of characters • Output: meaningful units called Tokens • Keywords: such as IF and THEN which represent the strings of characters “if” and “then” • Symbols: such as PLUS and MINUS which represent the characters “+” and “-” • Variable tokens: such as ID and NUM which represent identifiers and numbers

The Scanning Process • Although the task of the scanner is to convert the entire source program into a sequence of tokens, the scanner will rarely do this all at once • Instead, the scanner will operate under the control of the parser, returning the single next token from the input on demand via a function that will have the C declaration: TokenType getToken();

The Scanning Process • The scanning process is a special case of pattern matching, so we need to study methods of: • Pattern specification We will use regular expressions for pattern specification • Pattern recognition We will use finite automata for recognizing patterns • We turn now to the study of regular expressions and finite automata

Topics for scanning Process

Topics • Regular Expressions • Finite Automata • Implementation of DFA

Regular Expression • Regular Expression • Regular Language • Regular Definition

Regular Expressions • There are several formalisms for specifying tokens but the most popular one is “regular languages” • A regular expression is an expression that matches sets of strings (the “language” of the regular expression)

Formal Language Theory • Alphabet∑ : a finite set of symbols (characters) • Ex: {a,b}, an ASCII character set • String: a finite sequence of symbols over ∑ • Ex: abab, aabb, a over {a,b}; “hello” over ASCII • Empty string є: zero-length string • є≠Ø≠ {є} • Language: a set of strings over ∑ • Ex: {a, b, abab} over {a,b} • Ex: a set of all valid C programs over ASCII

Regular Expressions • In its basic form, a regular expression is built up out of basic expressions (individual symbols) and the operations: • choice (|), • concatenation (no operator), and • repetition (*)

Operations on Strings • Concatenation (·): • a · b = ab, “hello” · ”there” = “hellothere” • Denoted by α· β = αβ • Exponentiation: • hello3 = hello · hello · hello = hellohellohello, hello0 = є

Regular Expressions (Continued) • Examples of Regular Expressions • a(b|c)* : strings beginning with a, followed by any number of b’s and c’s • ab|c* : the string ab, or the strings , c, cc, …. • (a|c)*b(a|c)*: strings contain exactly one b • (b|)(a|ab)* : strings of a’s and b’s with no two consecutive b’s

Regular Languages over ∑ • Definition of regular languages over ∑ • Ø is regular • {a} is regular • {є} is regular • R∪S is regular if R, S are regular • R·S is regular if R, S are regular • Nothing else

Regular Expressions • Easy way to express a language that is accepted by FSA • Rules: • ε is a regular expression • Any symbol in Σ is a regular expression If r and s are any regular expressions then so is: • r|s denotes union e.g. “r or s” • rs denotes r followed by s (concatination) • (r)* denotes concatination of r with itself zero or more times (Kleene closer) • () used for controlling order of operations

Regular Expressions (Continued) • Common Extensions of R.E. operations: • + : one or more repetitions of (r+ is equivalent to rr*) • [] : range of characters ([abcd] = [a-d] = a|b|c|d) • . : any character (.*b.*:strings that contain at least one b) • ^ : negate a set ([^abc]= any character excepta, b, orc) • ? : optional sub-expression ((+|-)?[0-9] = [0-9] | +[0-9] | -[0-9]) • \ : “escape” an operator or meta-symbol

Example reg.exp: a|b  language : {a,b} (a|b)(a|b)  {aa,ab, ba, bb} a*  {€,a, aa, aaa, …..}

You Try • What is the language described by each Regular Expression? a* (a|b)* a|a*b (a|b)(a|b) aa|ab|ba|bb (+|-|ε)(0|1|2|3|4|5|6|7|8|9)*

Exercise Which of the following matches regexp /a(ab)*a/ • 1) abababa • 2) aaba • 3) aabbaa • 4) aba • 5) aabababa

2 and 5

Which of the following matches Reg-exp : /a.[bc]+/ • 1) abc • 2) abbbbbbbb • 3) azc • 4) abcbcbcbbc • 5) ac • 6) asccbbbbcbcccc

RE Shorthands • r? = r|Є (zero or one instance of r) • r+ = r·r* (positive closure) • Charater class: [abc] = a|b|c, [a-z] = a|b|c|…|z • Ex: ([ab]c?)+ = {a, b, aa, ab, ac, ba, bb, bc,…}

Regular Definitions If Σ is an alphabet of basic symbols, then a regular definition is a sequence of definitions of the form: D1→ R1 D2 → R2 … Dn→ Rn • Each di is a new symbol not in Σ andnot the same as any other of the d’s. • Each ri is a regular expression over ΣU(d1,d2,…,di-1)

Regular Definitions Example Example C identifiers: Σ = ASCII letter_→ a|b|c|…|z|A|B|C|…|Z|_ digit→ 0|1|2|…|9 id→letter_(letter_|digit)*

Regular Definitions Example Example Unsigned Numbers (integer or float): Σ = ASCII digit→ 0|1|2|…|9 digits→digit digit* optionalFraction→ . digits | ε optionalExponent → (E(+|-| ε)digits)| ε number→digits optionalFraction optionalExponent

Finite automate

Patterns for token • digit  [0-9] • digits  digit+ • Letter  [A –Z a-z] • If if • then then • else else • Relop <|<=|=|<>|>|>= • Idletter(letter|digit)* • Numdigits+ (.digits+)?(E(+|-)?digits+)? • ws (blank|tab|newline)+

Transition Diagram • A transition diagram is a stylized flowchart. • Transition diagram is used to keep track of information about characters that are seen as the forward pointer scans the input. • We do so by moving from position to position in the diagrams as characters are read.

Transition Diagram • Have a collection of circles called states . • Each state represents a condition that could occur during the process of scanning the input looking for a lexeme that matches one of several patterns . • Start state • Accept state • Any state

Transition Diagram • Edge : represent transitions from one state to another as caused by the input

Finite automata: why? • Regular expressions = specification • Finite automata = implementation

Definition : Finite Automaton

Exercise • A finite automaton that accepts only “a” • Language of FS is the set of accepted states

Finite Automata • Two types – both describe what are called regular languages • Deterministic (DFA) – There is a fixed number of states and we can only be in one state at a time • Nondeterministic (NFA) –There is a fixed number of states but we can be in multiple states at one time • NFA and DFA stand for nondeterministic finite automaton and deterministic finite automaton, respectively.

Finite Automata

State Diagram 1 0 0 q3 q1 q2 1 0, 1

Finite Automata • Transition • s1 a s2 Is read • In state s1 on input a go to state s2 • •If end of input and in accepting state => accept • •Otherwise => reject

Data Representation

CS510

CS510

Presentation Transcript

CS510 Concurrent Systems

CS510 Concurrent Systems

CS510 Concurrent Systems Class 13

CS510 Concurrent Systems

CS510 Concurrent Systems Jonathan Walpole

CS510 Concurrent Systems Jonathan Walpole

CS510 Concurrent Systems

CS510 Concurrent Systems Tyler Fetters

CS510 Concurrent Systems

CS510 Concurrent Systems Jonathan Walpole

CS510 AI and Games

CS510 Concurrent Systems Class 1

CS510 Concurrent Systems Jonathan Walpole

CS510 Concurrent Systems Jonathan Walpole

CS510 Concurrent Systems

CS510 Concurrent Systems Class 2

CS510 Concurrent Systems

CS510

CS510 Concurrent Systems Class 2a

CS510 Concurrent Systems

CS510 Concurrent Systems Jonathan Walpole

CS510 Concurrent Systems Jonathan Walpole