Lexical Analysis

Lexical Analysis

The Input • Read string input • Might be sequence of characters (Unix) • Might be sequence of lines (VMS) • Character set: • ASCII • ISO Latin-1 • ISO 10646 (16-bit = unicode) Ada, Java • Others (EBCDIC, JIS, etc)

The Output • A series of tokens: kind, location, name (if any) • Punctuation ( ) ; , [ ] • Operators + - ** := • Keywords begin end if while try catch • Identifiers Square_Root • String literals “press Enter to continue” • Character literals ‘x’ • Numeric literals • Integer: 123 • Floating_point: 4_5.23e+2 • Based representation: 16#ac#

Free form vs Fixed form • Free form languages (all modern ones) • White space does not matter. Ignore these: • Tabs, spaces, new lines, carriage returns • Only the ordering of tokens is important • Fixed format languages (historical) • Layout is critical • Fortran, label in cols 1-6 • COBOL, area A B • Lexical analyzer must know about layout to find tokens

Punctuation: Separators • Typically individual special characters such as ( { } : .. (two dots) • Sometimes double characters: lexical scanner looks for longest token: • (*, /* -- comment openers in various languages • Returned just as identity (kind) of token • And perhaps location for error messages and debugging purposes

Operators • Like punctuation • No real difference for lexical analyzer • Typically single or double special chars • Operators + - == <= • Operations := => • Returned as kind of token • And perhaps location

Keywords • Reserved identifiers • E.g. BEGIN END in Pascal, if in C, catch in C++ • Maybe distinguished from identifiers • E.g. mode vs mode in Algol-68 • Returned as kind of token • With possible location information • Oddity: unreserved keywords in PL/1 • IF IF THEN THEN = THEN + 1; • Handled as identifiers (parser disambiguates)

Identifiers • Rules differ • Length, allowed characters, separators • Need to build a names table • Single entry for all occurrences of Var1 • Language may be case insensitive: same entry for VAR1, vAr1, Var1 • Typical structure: hash table • Lexical analyzer returns token kind • And key (index) to table entry • Table entry includes location information

Organization of names table • Most common structure is hash table • With fixed number of headers • Chain according to hash code • Serial search on one chain • Hash code computed from characters (e.g. sum mod table size). • No hash code is perfect! Expect collisions. • Avoid any arbitrary limits on table or chain size.

String Literals • Text must be stored • Actual characters are important • Not like identifiers: must preserve casing • Character set issues: uniform internal representation • Table needed • Lexical analyzer returns key into table • May or may not be worth hashing to avoid duplicates

Character Literals • Similar issues to string literals • Lexical Analyzer returns • Token kind • Identity of character • Cannot assume character set of host machine, may be different

Numeric Literals • need a table to store numeric value • E.g. 123 = 0123 = 01_23 (Ada) • But cannot use predefined type for values • Because may have different bounds • Floating point representations much more complex • Denormals, correct rounding • Very delicate to compute correct value. • Host / target issues

Handling Comments • Comments have no effect on program • Can be eliminated by scanner • But may need to be retrieved by tools • Error detection issues • E.g. unclosed comments • Scanner skips over comments and returns next meaningful token

Case Equivalence • Some languages are case-insensitive • Pascal, Ada • Some are not • C, Java • Lexical analyzer ignores case if needed • This_Routine = THIS_RouTine • Error analysis may need exact casing • Friendly diagnostics follow user’s conventions

Performance Issues • Speed • Lexical analysis can become bottleneck • Minimize processing per character • Skip blanks fast • I/O is also an issue (read large blocks) • We compile frequently • Compilation time is important • Especially during development • Communicate with parser through global variables

General Approach • Define set of token kinds: • An enumeration type (tok_int, tok_if, tok_plus, tok_left_paren, tok_assign etc). • Or a series of integer definitions in more primitive languages… • Some tokens carry associated data • E.g. key for identifier table • May be useful to build tree node • For identifiers, literals etc

Interface to Lexical Analyzer • Either: Convert entire file to a file of tokens • Lexical analyzer is separate phase • Or: Parser calls lexical analyzer to supply next token • This approach avoids extra I/O • Parser builds tree incrementally, using successive tokens as tree nodes

Relevant Formalisms • Type 3 (Regular) Grammars • Regular Expressions • Finite State Machines • Equivalent in expressive power • Useful for program construction, even if hand-written

Regular Grammars • Regular grammars • Non-terminals (arbitrary names) • Terminals (characters) • Productions limited to the following: • Non-terminal ::= terminal • Non-terminal ::= terminal Non-terminal • Treat character class (e.g. digit) as terminal • Regular grammars cannot count: cannot express size limits on identifiers, literals • Cannot express proper nesting (parentheses)

Regular Grammars • grammar for real literals with no exponent • digit :: = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 • REAL ::= digit REAL1 • REAL1 ::= digit REAL1 (arbitrary size) • REAL1 ::= . INTEGER • INTEGER ::= digit INTEGER (arbitrary size) • INTEGER ::= digit • Start symbol is REAL

Regular Expressions • Regular expressions (RE) defined by an alphabet (terminal symbols) and three operations: • Alternation RE1 | RE2 • Concatenation RE1 RE2 • Repetition RE* (zero or more RE’s) • Language of RE’s = regular grammars • Regular expressions are more convenient for some applications

Specifying RE’s in Unix Tools • Single characters a b c d \x • Alternation [bcd] [b-z] ab|cd • Any character . (period) • Match sequence of characters x* y+ • Concatenation abc[d-q] • Optional RE [0-9]+(\.[0-9]*)?

Finite State Machines • A language defined by a grammar is a (possibly infinite) set of strings • An automaton is a computation that determines whether a given string belongs to a specified language • A finite state machine (FSM) is an automaton that recognize regular languages (regular expressions) • Simplest automaton: memory is single number (state)

Specifying an FSM • A set of labeled states • Directed arcs between states labeled with character • One or more states may be terminal (accepting) • A distinguished state is start • Automaton makes transition from state S1 to S2 • If and only if arc from S1 to S2 is labeled with next character in input • Token is legal if automaton stops on terminal state

Building FSM from Grammar • One state for each non-terminal • A rule of the form • Nt1 ::= terminal • Generates transition from S1 to final state • A rule of the form • Nt1 ::= terminal Nt2 • Generates transition from S1 to S2 on an arc labeled by the terminal

Graphic representation digit digit S Int letter letter letter underscore digit id digit

Building FSM’s from RE’s • Every RE corresponds to a grammar • For all regular expressions • A natural translation to FSM exists • Alternation often leads to non-deterministic machines

Non-Deterministic FSM • A non-deterministic FSM • Has at least one state • With two arcs to two distinct states • Labeled with the same character • Example: from start state, a digit can begin an integer literal or a real literal • Implementation requires backtracking • Nasty 

Deterministic FSM • For all states S • For all characters C: • There is at most one arc from any state S that is labeled with C • Much easier to implement • No backtracking 

From NFSM to DFSM • There is an algorithm for converting a non-deterministic machine to a deterministic one • Result may have exponentially more states • Intuitively: need new states to express uncertainty about token: int or real • Algorithm is efficient in practice (e.g. grep) • Other algorithms for minimizing number of states of FSM, for showing equivalence, etc.

Implementing the Scanner • Three methods • Hand-coded approach: • draw DFSM, then implement with loop and case statement • Hybrid approach : • define tokens using regular expressions, convert to NFSM, apply algorithm to obtain minimal DSFM • Hand-code resulting DFSM • Automated approach: • Use regular grammar as input to lexical scanner generator (e.g. LEX)

Hand-coding • Normal coding techniques • Scan over white space and comments till non-blank character found. • Branch depending on first character: • If digit, scan numeric literal • If character, scan identifier or keyword • If operator, check next character (++, etc.) • Need table to determine character type efficiently • Return token found • Write aggressive efficient code: goto’s, global variables

Using grammar and FSM • Start with regular grammar or RE • Typically found in the language reference • example (Ada): • Chapter 2. Lexical Elements • Digit ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 • decimal-literal ::= integer [.integer][exponent] • integer ::= digit {[underline] digit} • exponent ::= E [+] integer | E - integer

Using grammar and FSM • Create one state for each non-terminal • Label edges according to productions in grammar • Each state becomes a label in the program • Code for each state is a switch on next character, corresponding to edges out of current state • If no possible transition on next character, then: • If state is accepting, return the corresponding token • If state is not accepting, report error

Hand-coded version: • Each state is encoded as follows: • <<state1>>case Next_Character iswhen ‘a’ => goto state3;when ‘b’ => goto state1;when others => End_of_token_processing;endcase; • <<state2>> … • No explicit mention of state of automaton

Translating from FSM to code • variable holds current state: loopcase State iswhen state1 => <<state1>>case Next_Character iswhen ‘a’ => State := state3;when ‘b’ => State := state1;when others => End_token_processing;end case;when state2 … …end case; end loop;

Automatic scanner construction • LEX builds a transition table, indexed by state and by character. • Code gets transition from table: Tab : array (State, Character) of State := … begin while More_Input loop Curstate := Tab (Curstate, Next_Char); if Curstate = Error_State then …end loop;

Automatic FSM Generation • Our example, FLEX • See home page for manual in HTML • FLEX is given • A set of regular expressions • Actions associated with each RE • It builds a scanner • Which matches RE’s and executes actions

Flex General Format • Input to Flex is a set of rules: • Regexp actions (C statements) • Regexp actions (C statements) • … • Flex scans the longest matching Regexp • And executes the corresponding actions

An Example of a Flex scanner • DIGIT [0-9]ID [a-z][a-z0-9]*%%{DIGIT}+ { printf (“an integer %s (%d)\n”, yytext, atoi (yytext)); }{DIGIT}+”.”{DIGIT}* { printf (“a float %s (%g)\n”, yytext, atof (yytext));if|then|begin|end|procedure|function { printf (“a keyword: %s\n”, yytext));

Flex Example (continued) {ID} printf (“an identifier %s\n”, yytext);“+”|“-”|“*”|“/” { printf (“an operator %s\n”, yytext); } “--”.*\n /* eat Ada style comment */ [ \t\n]+ /* eat white space */ . printf (“unrecognized character”);%%

Assembling the flex program %{ #include <math.h> /* for atof */ %} <<flex text we gave goes here>> %% main (argc, argv) int argc; char **argv; { yyin = fopen (argv[1], “r”); yylex(); }

Running flex • flex is an executable program • The input is lexical grammar as described • The output is a running C program • For Ada fans • Look at aflex (www.adapower.com) • For C++ fans • flex can run in C++ mode • Generates appropriate classes

Choice Between Methods? • Hand written scanners • Typically much faster execution • Easy to write (standard structure) • Preferable for good error recovery • Flex approach • Simple to Use • Easy to modify token language

The GNAT Scanner • Hand written (scn.adb/scn.ads) • Each call does: • Optimal scan past blanks/comments etc. • Processing based on first character • Call special routines for major classes: • Namet.Get_Name for identifier (hashing) • Keywords recognized by special hash • Strings (scn-slit.adb): • complication with “+”, “and”, etc. (string or operator?) • Numeric literals (scn-nlit.adb): • complication with based literals: 16#FFF#

Historical oddities Because early keypunch machines were unreliable, FORTRAN treats blanks as optional: lexical analysis and parsing are intertwined. • DO10I=1.6 3 tokens: • identifier operator literal • DO10I = 1.6 • DO10I=1,6 7 tokens: • Keyword stmt id operator literal comma literal • DO 10 I = 1 , 6 • Celebrated NASA failure caused by this bug (?)

Lexical Analysis