1 / 43

SCANNING

SCANNING. I.E. LEXICAL ANALYSIS I.E. LINEAR ANALYSIS. Interaction of Parser and Scanner. The source program in HLL is a stream of characters read from left to right Scanner reads characters from that stream and groups them into tokens returned to the parser.

cranec
Download Presentation

SCANNING

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SCANNING I.E. LEXICAL ANALYSIS I.E. LINEAR ANALYSIS

  2. Interaction of Parser and Scanner • The source program in HLL is a stream of characters read from left to right • Scanner reads characters from that stream and groups them into tokens returned to the parser. • Whitespace (blanks, tabs, returns) and comments are eliminated

  3. Benefits of Modularity(Scanning is a separate module) • Simplicity: separate input from rest of program, which only deals with tokens and not characters.

  4. Benefits of Modularity(Scanning is a separate module) • Simplicity • Efficiency: • Scanning I/O intensive • Use buffering to improve efficiency

  5. Benefits of Modularity(Scanning is a separate module) • Simplicity • Efficiency • Portability: (historical) different machines might have different keyboards: e.g. Pascal: @ and up-arrow

  6. Benefits of Modularity(Scanning is a separate module) • Internationalisation: e.g • ALGOL 68 in many languages: Russian, German, French, Bulgarian, Japanes • LOGO (educational): English, French, Italian • Mama (educational): English, Hebrew, Yiddish, Chinese • M4 macro processor: English, German • Scratch (educational): 40+ languages • Perl: Klingon (different grammar as well)

  7. What are tokens? • Most basic meaningful basic objects in a computer program: • Sequences of characters which form the low-level constructs of the HLL (e.g. variable names, keywords, labels, operators) • Examples: A[12,index2] = getstuff(23); B[66,88] = getmorestuff(11);

  8. Token Structure • Early on (parser) need to recognize validity of structure  value of identifiers does not matter. • Tokens have 2 fields: type (compulsory) and value (optional) • Example • Vocabulary: • the type is often called “token” • the string being scanned (instance) is called a “lexeme”

  9. How to write a scanner • Brute-force scanning (e.g. section that handles identifiers starting with ‘c’)

  10. How to write a scanner • Brute-force scanning (e.g. section that handles identifiers starting with ‘c’) • Problems?

  11. How to write a scanner • Hand-coded Finite State Automaton (FSA) or Transition Diagram • EXERCISE: Draw FSA for example above • Assume FSA recognizes « class », « case » and other identifiers which only contain letters.

  12. Finite State Automata • A Finite State Automaton (FSA) A consists of 4 objects • A set I called the input alphabet, of input symbols • A set S of states the automaton can be in; • A designated state s0 called the initial state; • A designated set of states called the set of accepting states, or final states; • A next-state function N: S×I → S that associates a “next-state” to each ordered pair consisting of a “current state” and “current input”. For each state s in S and input symbol m in I, N(s,m) is called the state to which A goes if m is input to A when A is in state s.

  13. FSA: Transition Diagrams • The operation of an FSA is commonly described by a diagram called a (state-)transition diagram. • In a transition diagram, states are represented by circles, and accepting states by double circles. • There is one arrow that points to the initial state and other arrows between states as follows: There is an arrow from state s to state t labeled m (∈I) iff N(s,m)=t.

  14. FSA: Next State and Eventual-State • The next-state table is a tabular representation of the next-state function. In the annotated next-state table, the initial state is indicated by an arrow and the accepting states by double circles. • The eventual-state function of A is the function N*: S×I* → S defined as: for any state s of S and any input string w in I*, N*(s,w) = the state to which A goes if the symbols of w are input into A in sequence starting when A is in state s.

  15. How to write a scanner • Hand-coded Finite State Automaton (FSA) or Transition Diagram • How to code an FSA? • EXERCISE: CODE FSA

  16. How to write a scanner Need: • isFinal(state) function • NextState[state,c] table • TokenState[state] table

  17. How to write a scanner • Question 1: What about lexical errors?

  18. How to write a scanner • Question 2: What if tokens are not delimited by whitespace? • EXERCISE: add “<”, “<=”, and “=” tokens to language

  19. Returning character to stream Need: • Buffering to read and put back characters

  20. How to write a scanner • Question 3: How to indicate whether to consume last char?

  21. How to write a scanner • Question 4: How to make NextState efficient?

  22. How to write a scanner • Question 5: How to optimize for groups of characters? • E.g. groups of letters for identifiers, groups of numbers for numeric constants, groups of characters for string constants.

  23. How to write a scanner • Question 6: Who is going to create such FSAs and associated tables?

  24. Kleene’s Theorem • A language is accepted by an FSA iff it can be described by a regular expression. Such a language is called a regular language.

  25. Formal Languages – Alphabets and Strings • An alphabet Σ is a finite set of characters (or symbols). • A word, or sequence, or string over Σ is any group of 0 or more consecutive characters of Σ. • The length of a word is the number of characters in the word. • The null string is the string of length 0. It is denoted ε or λ. • A string of length n is really an ordered n-tuple of characters written without parentheses or commas. • Given two strings x and y over Σ, the concatenation of x and y is the string xy obtained by putting all the characters of y right after x.

  26. Formal Languages – Languages over alphabet Let Σ be an alphabet. A formal language over Σ is a set of strings over Σ. • ∅ is the empty language (over Σ) • Σn = {all strings over Σ that have length n} where n∈N • Σ+= the positive closure of Σ ={all strings over Σ that have length ≥ 1} • Σ*= the Kleene closure of Σ = {all strings over Σ}

  27. Formal Languages – Operation on Languages Let Σ be an alphabet. Let L and L′ be two languages defined over Σ. The following operations define new languages over Σ: • The concatenation of L and L′, denoted LL′, is LL′ = {xy | x∈L ∧ y∈L′} • The union of L and L′, denoted L∪L′, is L∪L′ = {x | x∈L ∨ y∈L′} • The Kleene closure of L, denoted L*, is L*={ x | x is a concatenation of any finite number of strings in L}. Note that ε∈L*.

  28. Regular Expressions - Definition Let Σ be an alphabet. The following are regular expressions (r.e.) over Σ: • I. BASE: ε and each individual symbol of Σ are regular expressions. • II. RECURSION: if r and s are regular expressions over Σ, then the following are also regular expressions over Σ: • (rs) the concatenation of r and s • (r | s) r or s • (r*) the Kleene closure of r • III.RESTRICTION: The only regular expressions over Σ are the ones defined by I and II above.

  29. Regular Expressions – Operator Precedence The order of precedence of r.e. operators are, from highest to lowest: • Highest: () • * • concatenation • Lowest: |

  30. REs – Languages defined by REs • Let Σ be an alphabet. Define a function L as follows: • L: {all r.e.'s over Σ}→{all languages over Σ} • L(r) = the language defined by r • I. L(ε) = {ε}, ∀a∈Σ L(a)={a} • II. RECURSION: If L(r) and L(s) are the languages defined by the regular expressions r and s over Σ, then • L(rs) = L(r)L(s) • L(r|s) = L(r) ∪ L(s) • L(r*) = (L(r))*

  31. REs – Languages defined by REs Variations • Some definitions of regular expressions and regular languages define ∅ to be a r.e. with L(∅)=∅

  32. Properties of REs Regular expressions can be simplified by applying the following properties: For any regular expressions r, s, t,

  33. REs – Notional Shorthands Here are some frequent constructs which have their own notation: • (r)+ means one or more instances of r. L((r)+) = (L(r))+ • (r)? means 0 or 1 instances of r. i.e. (r)? = r|ε L((r)?) = (L(r|ε)) = L(r) ∪ L(ε) = L(r) ∪ {ε} • Character classes: • [abc] = a|b|c • [a-z] = a|b|…|z

  34. REs – Regular Definitions Regular expressions can be broken down into regular definitions: sequences of expressions of the form • d1 → r1 • … • dn→rn where each di is a distinct name and riis a regular expression over symbols in Σ ∪ {d1, d2, … di-1}

  35. REs – Examples Regular expression • Identifier = [A-Za-z][A-Za-z0-9]* Can be broken down into the regular definitions • letter [A-Za-z] • digit [0-9] • identifier letter (letter | digit)*

  36. REs and Scanning Why regular expressions for scanning?

  37. Regular Languages and FSA • Let A be a FSA with set of input symbols I. Let w be a string of I*. Then w is accepted by A iffN*(s0,w) is an accepting state. • The language accepted by A, denoted L(A), is the set of all strings that are accepted by A. L(A) = {w∈I* | N*( s0,w) is an accepting state of A} • Kleene’s Theorem: A language is accepted by an FSA iff it can be described by a regular expression. Such a language is called a regular language. • Theorem 1: Some languages are not regular. • Theorem 2: The set of regular languages over an alphabet I is closed under the complement, union and intersection operators.

  38. How to write a scanner • Question 7: Practically speaking: how to translate re’s into FSAs?

  39. RE → Transition Diagram • EXAMPLES

  40. START HERE

  41. STOP HERE

More Related