430 likes | 446 Views
SCANNING. I.E. LEXICAL ANALYSIS I.E. LINEAR ANALYSIS. Interaction of Parser and Scanner. The source program in HLL is a stream of characters read from left to right Scanner reads characters from that stream and groups them into tokens returned to the parser.
E N D
SCANNING I.E. LEXICAL ANALYSIS I.E. LINEAR ANALYSIS
Interaction of Parser and Scanner • The source program in HLL is a stream of characters read from left to right • Scanner reads characters from that stream and groups them into tokens returned to the parser. • Whitespace (blanks, tabs, returns) and comments are eliminated
Benefits of Modularity(Scanning is a separate module) • Simplicity: separate input from rest of program, which only deals with tokens and not characters.
Benefits of Modularity(Scanning is a separate module) • Simplicity • Efficiency: • Scanning I/O intensive • Use buffering to improve efficiency
Benefits of Modularity(Scanning is a separate module) • Simplicity • Efficiency • Portability: (historical) different machines might have different keyboards: e.g. Pascal: @ and up-arrow
Benefits of Modularity(Scanning is a separate module) • Internationalisation: e.g • ALGOL 68 in many languages: Russian, German, French, Bulgarian, Japanes • LOGO (educational): English, French, Italian • Mama (educational): English, Hebrew, Yiddish, Chinese • M4 macro processor: English, German • Scratch (educational): 40+ languages • Perl: Klingon (different grammar as well)
What are tokens? • Most basic meaningful basic objects in a computer program: • Sequences of characters which form the low-level constructs of the HLL (e.g. variable names, keywords, labels, operators) • Examples: A[12,index2] = getstuff(23); B[66,88] = getmorestuff(11);
Token Structure • Early on (parser) need to recognize validity of structure value of identifiers does not matter. • Tokens have 2 fields: type (compulsory) and value (optional) • Example • Vocabulary: • the type is often called “token” • the string being scanned (instance) is called a “lexeme”
How to write a scanner • Brute-force scanning (e.g. section that handles identifiers starting with ‘c’)
How to write a scanner • Brute-force scanning (e.g. section that handles identifiers starting with ‘c’) • Problems?
How to write a scanner • Hand-coded Finite State Automaton (FSA) or Transition Diagram • EXERCISE: Draw FSA for example above • Assume FSA recognizes « class », « case » and other identifiers which only contain letters.
Finite State Automata • A Finite State Automaton (FSA) A consists of 4 objects • A set I called the input alphabet, of input symbols • A set S of states the automaton can be in; • A designated state s0 called the initial state; • A designated set of states called the set of accepting states, or final states; • A next-state function N: S×I → S that associates a “next-state” to each ordered pair consisting of a “current state” and “current input”. For each state s in S and input symbol m in I, N(s,m) is called the state to which A goes if m is input to A when A is in state s.
FSA: Transition Diagrams • The operation of an FSA is commonly described by a diagram called a (state-)transition diagram. • In a transition diagram, states are represented by circles, and accepting states by double circles. • There is one arrow that points to the initial state and other arrows between states as follows: There is an arrow from state s to state t labeled m (∈I) iff N(s,m)=t.
FSA: Next State and Eventual-State • The next-state table is a tabular representation of the next-state function. In the annotated next-state table, the initial state is indicated by an arrow and the accepting states by double circles. • The eventual-state function of A is the function N*: S×I* → S defined as: for any state s of S and any input string w in I*, N*(s,w) = the state to which A goes if the symbols of w are input into A in sequence starting when A is in state s.
How to write a scanner • Hand-coded Finite State Automaton (FSA) or Transition Diagram • How to code an FSA? • EXERCISE: CODE FSA
How to write a scanner Need: • isFinal(state) function • NextState[state,c] table • TokenState[state] table
How to write a scanner • Question 1: What about lexical errors?
How to write a scanner • Question 2: What if tokens are not delimited by whitespace? • EXERCISE: add “<”, “<=”, and “=” tokens to language
Returning character to stream Need: • Buffering to read and put back characters
How to write a scanner • Question 3: How to indicate whether to consume last char?
How to write a scanner • Question 4: How to make NextState efficient?
How to write a scanner • Question 5: How to optimize for groups of characters? • E.g. groups of letters for identifiers, groups of numbers for numeric constants, groups of characters for string constants.
How to write a scanner • Question 6: Who is going to create such FSAs and associated tables?
Kleene’s Theorem • A language is accepted by an FSA iff it can be described by a regular expression. Such a language is called a regular language.
Formal Languages – Alphabets and Strings • An alphabet Σ is a finite set of characters (or symbols). • A word, or sequence, or string over Σ is any group of 0 or more consecutive characters of Σ. • The length of a word is the number of characters in the word. • The null string is the string of length 0. It is denoted ε or λ. • A string of length n is really an ordered n-tuple of characters written without parentheses or commas. • Given two strings x and y over Σ, the concatenation of x and y is the string xy obtained by putting all the characters of y right after x.
Formal Languages – Languages over alphabet Let Σ be an alphabet. A formal language over Σ is a set of strings over Σ. • ∅ is the empty language (over Σ) • Σn = {all strings over Σ that have length n} where n∈N • Σ+= the positive closure of Σ ={all strings over Σ that have length ≥ 1} • Σ*= the Kleene closure of Σ = {all strings over Σ}
Formal Languages – Operation on Languages Let Σ be an alphabet. Let L and L′ be two languages defined over Σ. The following operations define new languages over Σ: • The concatenation of L and L′, denoted LL′, is LL′ = {xy | x∈L ∧ y∈L′} • The union of L and L′, denoted L∪L′, is L∪L′ = {x | x∈L ∨ y∈L′} • The Kleene closure of L, denoted L*, is L*={ x | x is a concatenation of any finite number of strings in L}. Note that ε∈L*.
Regular Expressions - Definition Let Σ be an alphabet. The following are regular expressions (r.e.) over Σ: • I. BASE: ε and each individual symbol of Σ are regular expressions. • II. RECURSION: if r and s are regular expressions over Σ, then the following are also regular expressions over Σ: • (rs) the concatenation of r and s • (r | s) r or s • (r*) the Kleene closure of r • III.RESTRICTION: The only regular expressions over Σ are the ones defined by I and II above.
Regular Expressions – Operator Precedence The order of precedence of r.e. operators are, from highest to lowest: • Highest: () • * • concatenation • Lowest: |
REs – Languages defined by REs • Let Σ be an alphabet. Define a function L as follows: • L: {all r.e.'s over Σ}→{all languages over Σ} • L(r) = the language defined by r • I. L(ε) = {ε}, ∀a∈Σ L(a)={a} • II. RECURSION: If L(r) and L(s) are the languages defined by the regular expressions r and s over Σ, then • L(rs) = L(r)L(s) • L(r|s) = L(r) ∪ L(s) • L(r*) = (L(r))*
REs – Languages defined by REs Variations • Some definitions of regular expressions and regular languages define ∅ to be a r.e. with L(∅)=∅
Properties of REs Regular expressions can be simplified by applying the following properties: For any regular expressions r, s, t,
REs – Notional Shorthands Here are some frequent constructs which have their own notation: • (r)+ means one or more instances of r. L((r)+) = (L(r))+ • (r)? means 0 or 1 instances of r. i.e. (r)? = r|ε L((r)?) = (L(r|ε)) = L(r) ∪ L(ε) = L(r) ∪ {ε} • Character classes: • [abc] = a|b|c • [a-z] = a|b|…|z
REs – Regular Definitions Regular expressions can be broken down into regular definitions: sequences of expressions of the form • d1 → r1 • … • dn→rn where each di is a distinct name and riis a regular expression over symbols in Σ ∪ {d1, d2, … di-1}
REs – Examples Regular expression • Identifier = [A-Za-z][A-Za-z0-9]* Can be broken down into the regular definitions • letter [A-Za-z] • digit [0-9] • identifier letter (letter | digit)*
REs and Scanning Why regular expressions for scanning?
Regular Languages and FSA • Let A be a FSA with set of input symbols I. Let w be a string of I*. Then w is accepted by A iffN*(s0,w) is an accepting state. • The language accepted by A, denoted L(A), is the set of all strings that are accepted by A. L(A) = {w∈I* | N*( s0,w) is an accepting state of A} • Kleene’s Theorem: A language is accepted by an FSA iff it can be described by a regular expression. Such a language is called a regular language. • Theorem 1: Some languages are not regular. • Theorem 2: The set of regular languages over an alphabet I is closed under the complement, union and intersection operators.
How to write a scanner • Question 7: Practically speaking: how to translate re’s into FSAs?
RE → Transition Diagram • EXAMPLES