Lexical Analysis

Lexical Analysis (C) Edmond Schonberg, New-York University

The Input • Read string input • Might be sequence of characters (Unix) • Might be sequence of lines (VMS) • Character set: • ASCII • ISO Latin-1 • ISO 10646 (16-bit = unicode) Ada, Java • Others (EBCDIC, JIS, etc) (C) Edmond Schonberg, New-York University

The Output • A series of tokens: kind, location, name (if any) • Punctuation ( ) ; , [ ] • Operators + - ** := • Keywords begin end if while try catch • Identifiers Square_Root • String literals “press Enter to continue” • Character literals ‘x’ • Numeric literals • Integer: 123 • Floating_point: 4_5.23e+2 • Based representation: 16#ac# (C) Edmond Schonberg, New-York University

Free form vs Fixed form • Free form languages (all modern ones) • White space does not matter. Ignore these: • Tabs, spaces, new lines, carriage returns • Only the ordering of tokens is important • Fixed format languages (historical) • Layout is critical • Fortran, label in cols 1-6 • COBOL, area A B • Lexical analyzer must know about layout to find tokens (C) Edmond Schonberg, New-York University

Keywords • Reserved identifiers • E.g. BEGIN END in Pascal, if in C, catch in C++ • Returned as kind of token (C) Edmond Schonberg, New-York University

Identifiers • Rules differ • Length, allowed characters, separators • Need to build a names table(symbol table) • Single entry for all occurrences of Var1 • Language may be case insensitive: same entry for VAR1, vAr1, Var1 • Typical structure: hash table • Lexical analyzer returns token kind • And key (index) to table entry • Table entry includes location information (C) Edmond Schonberg, New-York University

String Literals • Text must be stored • Actual characters are important • Not like identifiers: must preserve casing • Character set issues: uniform internal representation • Table needed • Lexical analyzer returns key into table • May or may not be worth hashing to avoid duplicates (C) Edmond Schonberg, New-York University

Handling Comments • Comments have no effect on program • Can be eliminated by scanner • But may need to be retrieved by tools • Error detection issues • E.g. unclosed comments • Scanner skips over comments and returns next meaningful token (C) Edmond Schonberg, New-York University

Case Equivalence • Some languages are case-insensitive • Pascal, Ada • Some are not • C, Java • Lexical analyzer ignores case if needed • This_Routine = THIS_RouTine • Error analysis may need exact casing • Friendly diagnostics follow user’s conventions (C) Edmond Schonberg, New-York University

Performance Issues • Speed • Lexical analysis can become bottleneck • Minimize processing per character • Skip blanks fast • I/O is also an issue (read large blocks) • We compile frequently • Compilation time is important • Especially during development • Communicate with parser through global variables (C) Edmond Schonberg, New-York University

General approach to writing lexical analyser • Define set of token kinds: • An enumeration type (tok_int, tok_if, tok_plus, tok_left_paren, tok_assign etc). • Or a series of integer definitions in more primitive languages… • Some tokens carry associated data • E.g. key for identifier table • May be useful to build tree node • For identifiers, literals etc (C) Edmond Schonberg, New-York University

Interface to Lexical Analyzer • Either: Convert entire file to a file of tokens • Lexical analyzer is separate phase • Or: Parser calls lexical analyzer to supply next token • This approach avoids extra I/O • Parser builds tree incrementally, using successive tokens as tree nodes (C) Edmond Schonberg, New-York University

Relevant Formalisms • Type 3 (Regular) Grammars • Regular Expressions • Finite State Machines • Equivalent in expressive power • Useful for program construction, even if hand-written (C) Edmond Schonberg, New-York University

Regular Grammars • Regular grammars • Non-terminals (arbitrary names) • Terminals (characters) • Productions limited to the following: • Non-terminal ::= terminal • Non-terminal ::= terminal Non-terminal • Treat character class (e.g. digit) as terminal • Regular grammars cannot count: cannot express size limits on identifiers, literals • Cannot express proper nesting (parentheses) (C) Edmond Schonberg, New-York University

Grammars – an example • grammar for real literals with no exponent • digit :: = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 • REAL ::= digit REAL1 • REAL1 ::= digit REAL1 (arbitrary size) • REAL1 ::= . INTEGER • INTEGER ::= digit INTEGER (arbitrary size) • INTEGER ::= digit • Start symbol is REAL (C) Edmond Schonberg, New-York University

Regular Expressions • Regular expressions (RE) defined by an alphabet (terminal symbols) and three operations: • Alternation RE1 | RE2 • Concatenation RE1 RE2 • Repetition RE* (zero or more RE’s) • Language of RE’s = regular grammars • Regular expressions are more convenient for some applications (C) Edmond Schonberg, New-York University

Specifying RE’s in Unix Tools • Single characters a b c d \x • Alternation [bcd] [b-z] ab|cd • Any character . (period) • Match sequence of characters x* y+ • Concatenation abc[d-q] • Optional RE [0-9]+(\.[0-9]*)? (C) Edmond Schonberg, New-York University

Finite State Machines • A language defined by a grammar is a (possibly infinite) set of strings • An automaton is a computation that determines whether a given string belongs to a specified language • A finite state machine (FSM) is an automaton that recognize regular languages (regular expressions) • Simplest automaton: memory is single number (state) (C) Edmond Schonberg, New-York University

Specifying an FSM • A set of labeled states • Directed arcs between states labeled with character • One or more states may be terminal (accepting) • A distinguished state is start • Automaton makes transition from state S1 to S2 • If and only if arc from S1 to S2 is labeled with next character in input • Token is legal if automaton stops on terminal state (C) Edmond Schonberg, New-York University

Building FSM from Grammar • One state for each non-terminal • A rule of the form • Nt1 ::= terminal • Generates transition from S1 to final state • A rule of the form • Nt1 ::= terminal Nt2 • Generates transition from S1 to S2 on an arc labeled by the terminal (C) Edmond Schonberg, New-York University

Graphic representation digit digit S Int letter letter letter underscore digit id digit (C) Edmond Schonberg, New-York University

Building FSM’s from RE’s • Every RE corresponds to a grammar • For all regular expressions • A natural translation to FSM exists • Alternation often leads to non-deterministic machines (C) Edmond Schonberg, New-York University

Non-Deterministic FSM • A non-deterministic FSM • Has at least one state • With two arcs to two distinct states • Labeled with the same character • Example: from start state, a digit can begin an integer literal or a real literal • Implementation requires backtracking • Nasty  (C) Edmond Schonberg, New-York University

Deterministic FSM • For all states S • For all characters C: • There is at most one arc from any state S that is labeled with C • Much easier to implement • No backtracking  (C) Edmond Schonberg, New-York University

From NFSM to DFSM • There is an algorithm for converting a non-deterministic machine to a deterministic one • Result may have exponentially more states • Intuitively: need new states to express uncertainty about token: int or real • Algorithm is efficient in practice (e.g. grep) • Other algorithms for minimizing number of states of FSM, for showing equivalence, etc. (C) Edmond Schonberg, New-York University

Implementing the Scanner • Three methods • Hand-coded approach: • draw DFSM, then implement with loop and case statement • Hybrid approach : • define tokens using regular expressions, convert to NFSM, apply algorithm to obtain minimal DSFM • Hand-code resulting DFSM • Automated approach: • Use regular grammar as input to lexical scanner generator (e.g. LEX) (C) Edmond Schonberg, New-York University

Hand-coding • Normal coding techniques • Scan over white space and comments till non-blank character found. • Branch depending on first character: • If digit, scan numeric literal • If character, scan identifier or keyword • If operator, check next character (++, etc.) • Need table to determine character type efficiently • Return token found • Write aggressive efficient code: goto’s, global variables (C) Edmond Schonberg, New-York University

Using grammar and FSM • Start with regular grammar or RE • Typically found in the language reference • example (Ada): • Chapter 2. Lexical Elements • Digit ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 • decimal-literal ::= integer [.integer][exponent] • integer ::= digit {[underline] digit} • exponent ::= E [+] integer | E - integer (C) Edmond Schonberg, New-York University

Using grammar and FSM • Create one state for each non-terminal • Label edges according to productions in grammar • Each state becomes a label in the program • Code for each state is a switch on next character, corresponding to edges out of current state • If no possible transition on next character, then: • If state is accepting, return the corresponding token • If state is not accepting, report error (C) Edmond Schonberg, New-York University

Hand-coded version: • Each state is encoded as follows: • <<state1>>case Next_Character iswhen ‘a’ => goto state3;when ‘b’ => goto state1;when others => End_of_token_processing;endcase; • <<state2>> … • No explicit mention of state of automaton (C) Edmond Schonberg, New-York University

Translating from FSM to code • variable holds current state: loopcase State iswhen state1 => <<state1>>case Next_Character iswhen ‘a’ => State := state3;when ‘b’ => State := state1;when others => End_token_processing;end case;when state2 … …end case; end loop; (C) Edmond Schonberg, New-York University

Automatic scanner construction • LEX builds a transition table, indexed by state and by character. • Code gets transition from table: Tab : array (State, Character) of State := … begin while More_Input loop Curstate := Tab (Curstate, Next_Char); if Curstate = Error_State then …end loop; (C) Edmond Schonberg, New-York University

Automatic FSM Generation • Our example, FLEX • See home page for manual in HTML • FLEX is given • A set of regular expressions • Actions associated with each RE • It builds a scanner • Which matches RE’s and executes actions (C) Edmond Schonberg, New-York University

An Example of a Flex scanner • DIGIT [0-9]ID [a-z][a-z0-9]*%%{DIGIT}+ { printf (“an integer %s (%d)\n”, yytext, atoi (yytext)); }{DIGIT}+”.”{DIGIT}* { printf (“a float %s (%g)\n”, yytext, atof (yytext));if|then|begin|end|procedure|function { printf (“a keyword: %s\n”, yytext)); (C) Edmond Schonberg, New-York University

Flex Example (continued) {ID} printf (“an identifier %s\n”, yytext);“+”|“-”|“*”|“/” { printf (“an operator %s\n”, yytext); } “--”.*\n /* eat Ada style comment */ [ \t\n]+ /* eat white space */ . printf (“unrecognized character”);%% (C) Edmond Schonberg, New-York University

Assembling the flex program %{ #include <math.h> /* for atof */ %} <<flex text we gave goes here>> %% main (argc, argv) int argc; char **argv; { yyin = fopen (argv[1], “r”); yylex(); } (C) Edmond Schonberg, New-York University

Choice Between Methods? • Hand written scanners • Typically much faster execution • Easy to write (standard structure) • Preferable for good error recovery • Flex approach • Simple to Use • Easy to modify token language (C) Edmond Schonberg, New-York University

Lexical Analysis