210 likes | 225 Views
Understand lexical analysis as pattern matching in programming, transforming sequences of characters to lexemes with regular grammars and state machines. Learn constructing transition tables and recognizing final states.
E N D
Lexical Analysis (4.2) Programming Languages Hiram College Ellen Walker
Lexical Analysis is Pattern Matching • From a sequence of characters to a sequence of lexemes, e.g. • “public static void main(char[] args)” -> • <id> <id> <id> <id> <lparen> <id> <lsquare> <rsquare> <id> <rparen> • Patterns are simpler (easy grammars), e.g. <id> -> <letter> <id> | <letter> <letter> -> a | b | c | … | z
Regular Grammars • Subset of Context Free Grammars • Every rule contains at most one non-terminal symbol (or can be rewritten so it does…)
Rewritten Grammar for ID • Original: <id> -> <letter> <id> | <letter> <letter> -> a | b | c | … | z • Rewrite: <id> -> (a | b | c | … | z) <id> | (a | b | c | … z ) • Fully expanded (52 rules): <id> -> a <id> | b <id> | c <id> … a | b | c |… | z
Parsing using a Regular Grammar • Transform the grammar into a state machine • Implement the state machine in a computer program • By hand • Automatically, using table-lookup • Run this program on input strings
What is a State Machine? • State machine abstraction • At any time, the process is in a “state” • Each time an “event” happens, the process takes an “action” and goes to the next state • We can describe the entire algorithm as a diagram where each state has an arrow for each event/action pair to the next appropriate state
State Machine for a Kitten Happy Food available / Eat Toys available / Play Sleeping Hungry X hrs passed / Awaken
State Machine for a Language • Each “event” processes an input symbol • Two important special states • Initial state: state the machine is in before the first symbol • Final state: state the machine is in whenever the sequence of symbols up to now is in the language
Transforming a Regular Grammar to a State Machine • Put the grammar into a form so every rule is <nonterm1> -> symbol <nonterm2> <nonterm1> -> symbol • Make a state for each nonterminal • Make a transition (arrow) for each rule. The transition goes from <nonterm1> to <nonterm2> based on the symbol. • The start symbol of the grammar is initial. • There is one final state that every rule that doesn’t have a nonterminal on the right goes to.
State Machine Example • <id> -> a <id> | b <id> | a | b • Two states: id (initial) and f (final) • Example: aabba
Simpler State Machine • This is a cleaner version of the other machine. Each character, state combination has only one next state. • It is called a DFA (deterministic finite automaton)
From DFA to Program • Method doScan() reads tokens from an input stream (assume System.in for now) and creates a list of them in order. • Method lex(s) scans and returns a single Token from a stream. • A Token consists of a type (e.g. INT) and a string (e.g. “1234”)
Defining Constants • //Number all the states • Public static final int NUMSTATES = 4; • Public static final int START = 0; • Public static final int INT = 1; • Public static final int ID = 2; • Public static final int UNK = 3; • Public static final int ERR = 4;
Constructing Transition Table (in constructor) String chars = “01234abcdef+-()” int[][] tt = new int[[chars.size()][NUMSTATES]; tt[ID][5] = ID; // ’a’ tt[ID][6] = ID; // ’b’ tt[START][5] = ID; // ’a’ tt[START][1] = INT; // … etc … tt[ID][0] = ERR; // … etc …
Recognizing Final States • //For this grammar, all states but ERR are final • //Usually, this method is a bit more complex • boolean final(int state){ • return (state != ERR); • }
Lex Method • //Read one token from the input ( any Scanner) • public static Token lex(Scanner s){ • //initialize variables • StringBuilder lexeme = new StringBuilder; • int state = START; • char ch = s.nextChar(); • …
Lex Method (cont’d) • //loop through characters, updating state • while (state != ERR){ • oldstate = state; • lexeme += ch; • state = tt[oldstate][chars.indexOf(ch)]; • ch = s.getChar(); • }
Lex Method (cont’d) • //return the token • if final(oldstate) //valid token • return new Token(oldstate,lexeme); • else //not a valid token – return the chars • return new Token(ERR, lexeme); • } //end of lex()
From DFA to Program (cont’d) Public static boolean doScan(){ Scanner s = new Scanner (System.in); while(s.peek()){ //not EOF //removes whitespace eatWhitespace(s); token = lex(s); tokens.add(token); if (token.getType == ERR) return false; } return true;
Another Program (pp. 176-181) • Programmed in C (no classes) • Global variables instead of class variables (used in many functions, e.g. charClass) • Token (int) and lexeme (string) unconnected • States and transitions are implicit • Lex() is a big case statement • Many special purpose functions, e.g. getChar(), addChar(), lookup() executing portions of DFA