1 / 54

Lexical Analysis and Scanning

Lexical Analysis and Scanning. Compiler Construction Lecture 2 Spring 2001 Robert Dewar. The Input. Read string input Might be sequence of characters (Unix) Might be sequence of lines (VMS) Character set ASCII ISO Latin-1 ISO 10646 (16-bit = unicode) Others (EBCDIC, JIS, etc).

pisces
Download Presentation

Lexical Analysis and Scanning

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lexical Analysis and Scanning Compiler Construction Lecture 2 Spring 2001 Robert Dewar

  2. The Input • Read string input • Might be sequence of characters (Unix) • Might be sequence of lines (VMS) • Character set • ASCII • ISO Latin-1 • ISO 10646 (16-bit = unicode) • Others (EBCDIC, JIS, etc)

  3. The Output • A series of tokens • Punctuation ( ) ; , [ ] • Operators + - ** := • Keywords begin end if • Identifiers Square_Root • String literals “hello this is a string” • Character literals ‘x’ • Numeric literals 123 4_5.23e+2 16#ac#

  4. Free form vs Fixed form • Free form languages • White space does not matter • Tabs, spaces, new lines, carriage returns • Only the ordering of tokens is important • Fixed format languages • Layout is critical • Fortran, label in cols 1-6 • COBOL, area A B • Lexical analyzer must worry about layout

  5. Punctuation • Typically individual special characters • Such as + - • Lexical analyzer does not know : from : • Sometimes double characters • E.g. (* treated as a kind of bracket • Returned just as identity of token • And perhaps location • For error message and debugging purposes

  6. Operators • Like punctuation • No real difference for lexical analyzer • Typically single or double special chars • Operators + - • Operations := • Returned just as identity of token • And perhaps location

  7. Keywords • Reserved identifiers • E.g. BEGIN END in Pascal, if in C • Maybe distinguished from identifiers • E.g. mode vs mode in Algol-68 • Returned just as token identity • With possible location information • Unreserved keywords (e.g. PL/1) • Handled as identifiers (parser distinguishes)

  8. Identifiers • Rules differ • Length, allowed characters, separators • Need to build table • So that junk1 is recognized as junk1 • Typical structure: hash table • Lexical analyzer returns token type • And key to table entry • Table entry includes location information

  9. More on Identifier Tables • Most common structure is hash table • With fixed number of headers • Chain according to hash code • Serial search on one chain • Hash code computed from characters • No hash code is perfect! • Avoid any arbitrary limits

  10. String Literals • Text must be stored • Actual characters are important • Not like identifiers • Character set issues • Table needed • Lexical analyzer returns key to table • May or may not be worth hashing

  11. Character Literals • Similar issues to string literals • Lexical Analyzer returns • Token type • Identity of character • Note, cannot assume character set of host machine, may be different

  12. Numeric Literals • Also need a table • Typically record value • E.g. 123 = 0123 = 01_23 (Ada) • But cannot use int for values • Because may have different characteristics • Float stuff much more complex • Denormals, correct rounding • Very delicate stuff

  13. Handling Comments • Comments have no effect on program • Can therefore be eliminated by scanner • But may need to be retrieved by tools • Error detection issues • E.g. unclosed comments • Scanner does not return comments

  14. Case Equivalence • Some languages have case equivalence • Pascal, Ada • Some do not • C, Java • Lexical analyzer ignores case if needed • This_Routine = THIS_RouTine • Error analysis may need exact casing

  15. Issues to Address • Speed • Lexical analysis can take a lot of time • Minimize processing per character • I/O is also an issue (read large blocks) • We compile frequently • Compilation time is important • Especially during development

  16. General Approach • Define set of token codes • An enumeration type • A series of integer definitions • These are just codes (no semantics) • Some codes associated with data • E.g. key for identifier table • May be useful to build tree node • For identifiers, literals etc

  17. Interface to Lexical Analyzer • Convert entire file to a file of tokens • Lexical analyzer is separate phase • Parser calls lexical analyzer • Get next token • This approach avoids extra I/O • Parser builds tree as we go along

  18. Implementation of Scanner • Given the input text • Generate the required tokens • Or provide token by token on demand • Before we describe implementations • We take this short break • To describe relevant formalisms

  19. Relevant Formalisms • Type 3 (Regular) Grammars • Regular Expressions • Finite State Machines

  20. Regular Grammars • Regular grammars • Non-terminals (arbitrary names) • Terminals (characters) • Two forms of rules • Non-terminal ::= terminal • Non-terminal ::= terminal Non-terminal • One non-terminal is the start symbol • Regular (type 3) grammars cannot count • No concept of matching nested parens

  21. Regular Grammars • Regular grammars • E.g. grammar of reals with no exponent • REAL ::= 0 REAL1 (repeat for 1 .. 9) • REAL1 ::= 0 REAL1 (repeat for 1 .. 9) • REAL1 ::= . INTEGER • INTEGER ::= 0 INTEGER (repeat for 1 .. 9) • INTEGER ::= 0 (repeat for 1 .. 9) • Start symbol is REAL

  22. Regular Expressions • Regular expressions (RE) defined by • Any terminal character is an RE • Alternation RE | RE • Concatenation RE1 RE2 • Repetition RE* (zero or more RE’s) • Language of RE’s = type 3 grammars • Regular expressions are more convenient

  23. Specifying RE’s in Unix Tools • Single characters a b c d \x • Alternation [bcd] [b-z] ab|cd • Match any character . • Match sequence of characters x* y+ • Concatenation abc[d-q] • Optional [0-9]+(.[0-9]*)?

  24. Finite State Machines • Languages and Automata • A language is a set of strings • An automaton is a machine • That determines if a given string is in the language or not. • FSM’s are automata that recognize regular languages (regular expressions)

  25. Definitions of FSM • A set of labeled states • Directed arcs labeled with character • A state may be marked as terminal • Transition from state S1 to S2 • If and only if arc from S1 to S2 • Labeled with next character (which is eaten) • Recognized if ends up in terminal state • One state is distinguished start state

  26. Building FSM from Grammar • One state for each non-terminal • A rule of the form • Nont1 ::= terminal • Generates transition from S1 to final state • A rule of the form • Nont1 ::= terminal Nont2 • Generates transition from S1 to S2

  27. Building FSM’s from RE’s • Every RE corresponds to a grammar • For all regular expressions • A natural translation to FSM exists • We will not give details of algorithm here

  28. Non-Deterministic FSM • A non-deterministic FSM • Has at least one state • With two arcs to two separate states • Labeled with the same character • Which way to go? • Implementation requires backtracking • Nasty 

  29. Deterministic FSM • For all states S • For all characters C • There is either ONE or NO arcs • From state S • Labeled with character C • Much easier to implement • No backtracking 

  30. Dealing with ND FSM • Construction naturally leads to ND FSM • For example, consider FSM for • [0-9]+ | [0-9]+\.[0-9]+ • (integer or real) • We will naturally get a start state • With two sets of 0-9 branches • And thus non-deterministic

  31. Converting to Deterministic • There is an algorithm for converting • From any ND FSM • To an equivalent deterministic FSM • Algorithm is in the text book • Example (given in terms of RE’s) • [0-9]+ | [0-9]+\.[0-9]+ • [0-9]+(\.[0-9]+)?

  32. Implementing the Scanner • Three methods • Completely informal, just write code • Define tokens using regular expressions • Convert RE’s to ND finite state machine • Convert ND FSM to deterministic FSM • Program the FSM • Use an automated program • To achieve above three steps

  33. Ad Hoc Code (forget FSM’s) • Write normal hand code • A procedure called Scan • Normal coding techniques • Basically scan over white space and comments till non-blank character found. • Base subsequent processing on character • E.g. colon may be : or := • / may be operator or start of comment • Return token found • Write aggressive efficient code

  34. Using FSM Formalisms • Start with regular grammar or RE • Typically found in the language standard • For example, for Ada: • Chapter 2. Lexical Elements • Digit ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 • decimal-literal ::= integer [.integer][exponent] • integer ::= digit {[underline] digit} • exponent ::= E [+] integer | E - integer

  35. Using FSM formalisms, cont • Given RE’s or grammar • Convert to finite state machine • Convert ND FSM to deterministic FSM • Write a program to recognize • Using the deterministic FSM

  36. Implementing FSM (Method 1) • Each state is code of the form: • <<state1>> case Next_Character is when ‘a’ => goto state3; when ‘b’ => goto state1; when others => End_of_token_processing; end case; • <<state2>> …

  37. Implementing FSM (Method 2) • There is a variable called State loop case State is when state1 =><<state1>> case Next_Character is when ‘a’ => State := state3; when ‘b’ => State := state1; when others => End_token_processing; end case; when state2 … … end case; end loop;

  38. Implementing FSM (Method 3) • T : array (State, Character) of State;while More_Input loop Curstate := T (Curstate, Next_Char); if Curstate = Error_State then …end loop;

  39. Automatic FSM Generation • Our example, FLEX • See home page for manual in HTML • FLEX is given • A set of regular expressions • Actions associated with each RE • It builds a scanner • Which matches RE’s and executes actions

  40. Flex General Format • Input to Flex is a set of rules: • Regexp actions (C statements) • Regexp actions (C statements) • … • Flex scans the longest matching Regexp • And executes the corresponding actions

  41. An Example of a Flex scanner • DIGIT [0-9]ID [a-z][a-z0-9]*%%{DIGIT}+ { printf (“an integer %s (%d)\n”, yytext, atoi (yytext)); }{DIGIT}+”.”{DIGIT}* { printf (“a float %s (%g)\n”, yytext, atof (yytext));if|then|begin|end|procedure|function { printf (“a keyword: %s\n”, yytext));

  42. Flex Example (continued) {ID} printf (“an identifier %s\n”, yytext);“+”|“-”|“*”|“/” { printf (“an operator %s\n”, yytext); } “--”.*\n /* eat Ada style comment */ [ \t\n]+ /* eat white space */ . printf (“unrecognized character”);%%

  43. Assembling the flex program %{ #include <math.h> /* for atof */ %} <<flex text we gave goes here>> %% main (argc, argv) int argc; char **argv; { yyin = fopen (argv[1], “r”); yylex(); }

  44. Running flex • flex is a program that is executed • The input is as we have given • The output is a running C program • For Ada fans • Look at aflex (www.adapower.com) • For C++ fans • flex can run in C++ mode • Generates appropriate classes

  45. Choice Between Methods? • Hand written scanners • Typically much faster execution • And pretty easy to write • And a easier for good error recovery • Flex approach • Simple to Use • Easy to modify token language

  46. The GNAT Scanner • Hand written (scn.adb/scn.ads) • Basically a call does • Super quick scan past blanks/comments etc • Big case statement • Process based on first character • Call special routines • Namet.Get_Name for identifier (hashing) • Keywords recognized by special hash • Strings (stringt.ads) • Integers (uintp.ads) • Reals (ureal.ads)

  47. More on the GNAT Scanner • Entire source read into memory • Single contiguous block • Source location is index into this block • Different index range for each source file • See sinput.adb/ads for source mgmt • See scans.ads for definitions of tokens

  48. More on GNAT Scanner • Read scn.adb code • Very easy reading, e.g.

  49. DTL (Dewar Trivial Language) • DTL Grammar • Program ::= DECLARE Decls BEGIN Stmts • Decls ::= {Decl}* • Stmts ::= {Stmt}+ • Type ::= INTEGER | REAL • Identifier ::= letter (_{digit}+)* • Decl ::= DECLARE identifier : Type

  50. DTL (Continued) • Integer_Literal ::= {digit}+Real_Literal ::= {digit}+”.”{digit}*Stmt ::= Assignstmt | Ifstmt | WhilestmtAssignstmt ::= Identifier := ExprExpr ::= Literal | (Expr) Op (Expr)Op ::= + | -Literal ::= Integer_Literal | Real_LiteralIfstmt ::= IF Expr Relop Expr THEN StmtsWhilestmt ::= WHILE Expr Relop Expr DO StmtsRelop ::= > | < | >= | <=

More Related