Lexical Analysis in Programming Languages: From Sequences to Lexemes

Lexical Analysis (4.2) Programming Languages Hiram College Ellen Walker

Lexical Analysis is Pattern Matching • From a sequence of characters to a sequence of lexemes, e.g. • “public static void main(char[] args)” -> • <id> <id> <id> <id> <lparen> <id> <lsquare> <rsquare> <id> <rparen> • Patterns are simpler (easy grammars), e.g. <id> -> <letter> <id> | <letter> <letter> -> a | b | c | … | z

Regular Grammars • Subset of Context Free Grammars • Every rule contains at most one non-terminal symbol (or can be rewritten so it does…)

Rewritten Grammar for ID • Original: <id> -> <letter> <id> | <letter> <letter> -> a | b | c | … | z • Rewrite: <id> -> (a | b | c | … | z) <id> | (a | b | c | … z ) • Fully expanded (52 rules): <id> -> a <id> | b <id> | c <id> … a | b | c |… | z

Parsing using a Regular Grammar • Transform the grammar into a state machine • Implement the state machine in a computer program • By hand • Automatically, using table-lookup • Run this program on input strings

What is a State Machine? • State machine abstraction • At any time, the process is in a “state” • Each time an “event” happens, the process takes an “action” and goes to the next state • We can describe the entire algorithm as a diagram where each state has an arrow for each event/action pair to the next appropriate state

State Machine for a Kitten Happy Food available / Eat Toys available / Play Sleeping Hungry X hrs passed / Awaken

State Machine for a Language • Each “event” processes an input symbol • Two important special states • Initial state: state the machine is in before the first symbol • Final state: state the machine is in whenever the sequence of symbols up to now is in the language

Transforming a Regular Grammar to a State Machine • Put the grammar into a form so every rule is <nonterm1> -> symbol <nonterm2> <nonterm1> -> symbol • Make a state for each nonterminal • Make a transition (arrow) for each rule. The transition goes from <nonterm1> to <nonterm2> based on the symbol. • The start symbol of the grammar is initial. • There is one final state that every rule that doesn’t have a nonterminal on the right goes to.

State Machine Example • <id> -> a <id> | b <id> | a | b • Two states: id (initial) and f (final) • Example: aabba

Simpler State Machine • This is a cleaner version of the other machine. Each character, state combination has only one next state. • It is called a DFA (deterministic finite automaton)

Lexical Analysis for Integer Expressions

From DFA to Program • Method doScan() reads tokens from an input stream (assume System.in for now) and creates a list of them in order. • Method lex(s) scans and returns a single Token from a stream. • A Token consists of a type (e.g. INT) and a string (e.g. “1234”)

Defining Constants • //Number all the states • Public static final int NUMSTATES = 4; • Public static final int START = 0; • Public static final int INT = 1; • Public static final int ID = 2; • Public static final int UNK = 3; • Public static final int ERR = 4;

Constructing Transition Table (in constructor) String chars = “01234abcdef+-()” int[][] tt = new int[[chars.size()][NUMSTATES]; tt[ID][5] = ID; // ’a’ tt[ID][6] = ID; // ’b’ tt[START][5] = ID; // ’a’ tt[START][1] = INT; // … etc … tt[ID][0] = ERR; // … etc …

Recognizing Final States • //For this grammar, all states but ERR are final • //Usually, this method is a bit more complex • boolean final(int state){ • return (state != ERR); • }

Lex Method • //Read one token from the input ( any Scanner) • public static Token lex(Scanner s){ • //initialize variables • StringBuilder lexeme = new StringBuilder; • int state = START; • char ch = s.nextChar(); • …

Lex Method (cont’d) • //loop through characters, updating state • while (state != ERR){ • oldstate = state; • lexeme += ch; • state = tt[oldstate][chars.indexOf(ch)]; • ch = s.getChar(); • }

Lex Method (cont’d) • //return the token • if final(oldstate) //valid token • return new Token(oldstate,lexeme); • else //not a valid token – return the chars • return new Token(ERR, lexeme); • } //end of lex()

From DFA to Program (cont’d) Public static boolean doScan(){ Scanner s = new Scanner (System.in); while(s.peek()){ //not EOF //removes whitespace eatWhitespace(s); token = lex(s); tokens.add(token); if (token.getType == ERR) return false; } return true;

Another Program (pp. 176-181) • Programmed in C (no classes) • Global variables instead of class variables (used in many functions, e.g. charClass) • Token (int) and lexeme (string) unconnected • States and transitions are implicit • Lex() is a big case statement • Many special purpose functions, e.g. getChar(), addChar(), lookup() executing portions of DFA

Lexical Analysis in Programming Languages: From Sequences to Lexemes

Lexical Analysis in Programming Languages: From Sequences to Lexemes

Presentation Transcript

Lexical Analysis

Lexical Analysis

Lexical Analysis

Lexical Analysis

Lexical Analysis

Lexical Analysis

Lexical Analysis

Lexical Analysis

LEXICAL ANALYSIS

Lexical Analysis

Lexical Analysis

Lexical Analysis

Lexical Analysis

Lexical Analysis

Lexical Analysis

Lexical Analysis

Lexical Analysis

Lexical Analysis

Lexical Analysis

Lexical Analysis