Discussion #1 Finite State Machines & Regular Expressions

Discussion #1Finite State Machines &Regular Expressions

Topics • Compilers and Interpreters • Lexical Analyzers • Regular Expressions • Finite State Machines & Finite State Transducers • Project 1

Program Code Program Tokens Internal Data Code Lexical Analyzer Parser Code Generator Syntax Analysis Error messages Compilers for Programming Languages Compiler Or Interpreter (Executed directly) Keywords String literals Variables …

Series of 5 Projects: Datalog Interpreter

Project 1: Lexical Analyzer

The Point of CS 236 • Use mathematics to write better code. • in Project 1: some sample code to help get started • in later projects: continue this process independently • Project 1: Use a Finite State Machine to write a Lexical Analyzer. • Lexical analyzers can identify patterns of text to be turned into tokens. • Regular expressions also identify patterns of text and are equivalent in pattern recognition power. • We’ll start first with regular expressions, which more intuitively identify text patterns, and then return to finite state machines, which more directly correspond to the code we need to write to identify text patterns in our lexical analyzer.

Regular Expressions • Pattern description for strings • Standard patterns: • Concatenation: abc matches …abc… but not …abdc… or …ac… • Boolean or: ab|ac matches …ab… and also …ac… but not …cba…or…bc… • Kleene closure: ab* matches …a… and …ab… and …abb… and … • Common shorthand patterns • Optional: ab?c matches …ac… and …abc… but not …abbc… short for ac|abc • One or more: ab+ matches …ab… and …abb… and … but not …a… short for abb*

Regular Expressions & Parens • Parens group regular expressions as expected • Examples: • (a|b)c matches …ac… and …bc… • (a|b)*c matches …c… and …ac… and …bac… and …ababababbbabbabaaaababaababbbbc… and … • (a|b)?c matches …c… and …ac… and …bc…

Regular Expression Extensions • Additional shorthand and notation • [ABC] = A|B|C • [A-Za-z] = A|B|…|Z|a|b|…|z • [A-Za-z]{4,7} matches any 4-7 letter sequence, e.g. …McKay… • \ is an escape character: \* matches …*… and \, matches …,… • Special characters: • Digit: \d • Word boundary: \b • Languages and language extensions/packages • Perl • Java regular-expression packages • Regular expression testers: • RegExr • regexpal

Regular Expressions &Finite State Machines • abc • a(b|c) • ab* • (a(b?c))+ a b c b a c b a Note the special double-circle designation of an accepting state. b c a c a

Formal Definition of a Finite State Machine & a Finite State Transducer A deterministic finite state machine is a quintuple (Σ,S,s0,δ,F), where: • Σ is the input alphabet (a finite, non-empty set of symbols). • S is a finite, non-empty set of states. • s0 is an initial state, an element of S. • δ is the state-transition function: δ : S Σ → S. • F is the set of final states, a (possibly empty) subset of S. A finite state transducer is a 6-tuple (Σ,Γ,S,s0,δ,F) as above except: • Γ is the output alphabet (a finite, non-empty set of symbols). • δ is the state-transition function: δ : S Σ → S Γ.

Project 1: Lexical Analyzer

Basic FSM for Project 1 start <character (except ‘ and <eof>)> ‘ ‘ String quote string ‘ <eof> u_eof < < = or <= <= … <space> | <tab> | <cr> white space <space> | <tab> | <cr> ident. <letter> or keywd. Special check for Keywords (Schemes, Facts, Rules, Queries) <letter> | <digit> <eof> eof <any other char> undef.

Get the Design Right Code must directly represent a state machine: • Σ: Set of characters (the keyboard character set) • S: Set of states (enum) • s0: An initial state (one of the states in the set of states) • δ : S Σ → S Γ: Transition function δ for each state: • Input: the current state and the next character • Output: • the next state • a TokenType (if the current token is now complete) Or null (if the current token is incomplete) • State machine loop: • Evaluates state transitions • Builds and emits tokens • Dirty work: discards whitespace tokens, tracks line numbers, etc.

State.cpp: List of States … enum State {Comma, Period, SawColon, Colon_Dash, SawAQuote, ProcessingString, PossibleEndOfString, Start, End }; …

Lex.cpp: State Initialization/Termination void Lex::generateTokens(Input* input) { tokens = new vector<Token*>(); index = 0; state = Start; while(state != End) { state = nextState(); } }

Lex.cpp: State Transition Function … State Lex::nextState() { State result; char character; switch(state) { case Start: result = getNextState(); break; case Comma: emit(COMMA); result = getNextState(); break; case Period: emit(PERIOD); result = getNextState(); break; case SawColon: character = input->getCurrentCharacter(); if(character == '-') { result = Colon_Dash; input->advance(); } else { //Every other character throw "ERROR:: in case SawColon:, Expecting '-' but found " + character + '.'; } break; case Colon_Dash: emit(COLON_DASH); result = getNextState(); break; case SawAQuote: character = input->getCurrentCharacter(); if(character == '\'') { result = PossibleEndOfString; } else if(character == -1) { throw "ERROR:: in Saw_A_Quote::nextState, reached EOF before end of string."; } else { //Every other character result = ProcessingString; } input->advance(); break; case ProcessingString: character = input->getCurrentCharacter(); if(character == '\'') { result = PossibleEndOfString; } else if(character == -1) { throw "ERROR:: in ProcessingString::nextState, reached EOF before end of string."; } else { //Every other character result = ProcessingString; } input->advance(); break; case PossibleEndOfString: if(input->getCurrentCharacter() == '\'') { input->advance(); result = ProcessingString; } else { //Every other character emit(STRING); result = getNextState(); } break; case End: throw "ERROR:: in End state:, the Input should be empty once you reach the End state."; break; }; return result; } …

Lex:cpp: Get Next Statefor State Transition Function State Lex::getNextState() { State result; char currentCharacter = input->getCurrentCharacter(); switch(currentCharacter) { case ',' : result = Comma; break; case '.' : result = Period; break; case ':' : result = SawColon; break; case '\'' : result = ProcessingString; break; case -1 : result = End; break; default: string error = "ERROR:: in Lex::getNextState, Expecting "; error += "'\'', '.', '?', '(', ')', '+', '*', '=', '!', '<', '>', ':' but found "; error += currentCharacter; error += '.'; throw error.c_str(); } input->advance(); return result; }

Lex.cpp: Emitfor State Transition Function void Lex::emit(TokenType tokenType) { Token* token = new Token(tokenType, input->getTokensValue(), input->getCurrentTokensLineNumber()); storeToken(token); input->mark(); }

TokenType.cpp: Turns the Token Type into a String for Output string TokenTypeToString(TokenType tokenType){ string result = ""; switch(tokenType){ case COMMA: result = "COMMA"; break; case PERIOD: result = "PERIOD"; break; case COLON_DASH: result = "COLON_DASH"; break; case STRING: result = "STRING"; break; case NUL: result = "NUL"; break; } return result; }

Discussion #1 Finite State Machines & Regular Expressions