Lexical Analysis: Regular Expressions

Lexical Analysis:Regular Expressions CS 671 January 22, 2008

Last Time … • A program that translates a program in one language to another language • the essential interface between applications & architectures • Typically lowers the level of abstraction • analyzes and reasons about the program & architecture • We expect the program to be optimized i.e., better than the original • ideally exploiting architectural strengths and hiding weaknesses Compiler High-Level Programming Languages Machine Code Error Messages

Phases of a Compiler Source program • Lexical Analyzer • Group sequence of characters into lexemes – smallest meaningful entity in a language (keywords, identifiers, constants) • Characters read from a file are buffered – helps decrease latency due to i/o. Lexical analyzer manages the buffer • Makes use of the theory of regular languages and finite state machines • Lex and Flex are tools that construct lexical analyzers from regular expression specifications Lexical analyzer Syntax analyzer Semantic analyzer Intermediate code generator Code optimizer Code generator Target program

Phases of a Compiler • Parser • Convert a linear structure – sequence of tokens – to a hierarchical tree-like structure – an AST • The parser imposes the syntax rules of the language • Work should be linear in the size of the input (else unusable)  type consistency cannot be checked in this phase • Deterministic context free languages and pushdown automata for the basis • Bison and yacc allow a user to construct parsers from CFG specifications Source program Lexical analyzer Syntax analyzer Semantic analyzer Intermediate code generator Code optimizer Code generator Target program

Phases of a Compiler • Semantic Analysis • Calculates the program’s “meaning” • Rules of the language are checked (variable declaration, type checking) • Type checking also needed for code generation (code gen for a + b depends on the type of a and b) Source program Lexical analyzer Syntax analyzer Semantic analyzer Intermediate code generator Code optimizer Code generator Target program

Phases of a Compiler • Intermediate Code Generation • Makes it easy to port compiler to other architectures (e.g. Pentium to MIPS) • Can also be the basis for interpreters (such as in Java) • Enables optimizations that are not machine specific Source program Lexical analyzer Syntax analyzer Semantic analyzer Intermediate code generator Code optimizer Code generator Target program

Phases of a Compiler • Intermediate Code Optimization • Constant propagation, dead code elimination, common sub-expression elimination, strength reduction, etc. • Based on dataflow analysis – properties that are independent of execution paths Source program Lexical analyzer Syntax analyzer Semantic analyzer Intermediate code generator Code optimizer Code generator Target program

Phases of a Compiler • Native Code Generation • Intermediate code is translated into native code • Register allocation, instruction selection Source program Lexical analyzer Syntax analyzer Semantic analyzer Intermediate code generator • Native Code Optimization • Peephole optimizations – small window is optimized at a time Code optimizer Code generator Target program

Administration • 1. Compiling to assembly • 1. HW1 on website: Fun with Lex/Yacc • 2. Questionnaire Results…

Useful Tools! • tar – archiving program • gzip/bzip2 – compression • svn – version control • Make/Scons – build/run utility • Other useful tools: • Man! • Which • Locate • Diff (or sdiff)

Makefiles • Target: dependent source file(s) • <tab>command proj1 data.o main.o io.o data.c data.h main.c io.h io.c

First Step: Lexical Analysis (Tokenizing) • Breaking the program down into words or “tokens” • Input: stream of characters • Output: stream of names, keywords, punctuation marks • Side effect: Discards white space, comments • Source code: if (b==0) a = “Hi”; • Token Stream: Lexical Analysis Parsing

Identifiers: x y11 elsex _i00 • Keywords: if else while break • Integers: 2 1000 -500 5L • Floating point: 2.0 0.00020 .02 1.1e5 0.e-10 • Symbols: + * { } ++ < << [ ] >= • Strings: “x” “He said, \“Are you?\”” • Comments: /** ignore me **/ Lexical Tokens

Lexical Tokens • float match0(char *s) /* find a zero */ • { • if (!strncmp(s, “0.0”, 3)) • return 0.; • } • FLOAT ID(match0) _______ CHAR STAR ID(s) RPAREN LBRACE IF LPAREN BANG _______ LPAREN ID(s) COMMA STRING(0.0) ______ NUM(3) RPAREN RPAREN RETURN REAL(0.0) ______ RBRACE EOF

Ad-hoc Lexer • Hand-write code to generate tokens • How to read identifier tokens? • Token readIdentifier( ) { • String id = “”; • while (true) { • char c = input.read(); • if (!identifierChar(c)) • return new Token(ID, id, lineNumber); • id = id + String(c); • } • }

Problems • Don’t know what kind of token we are going to read from seeing first character • if token begins with “i’’ is it an identifier? • if token begins with “2” is it an integer? constant? • interleaved tokenizer code is hard to write correctly, harder to maintain • More principled approach: lexer generatorthat generates efficient tokenizer automatically (e.g., lex, flex)

Issues • How to describe tokens unambiguously • 2.e0 20.e-01 2.0000 • “” “x” “\\” “\”\’” • How to break text down into tokens • if (x == 0) a = x<<1; • if (x == 0) a = x<1; • How to tokenize efficiently • tokens may have similar prefixes • want to look at each character ~1 time

How To Describe Tokens • Programming language tokens can be described using regular expressions • A regular expression R describes some set of strings L(R) • L(R) is the language defined by R • L(abc) = { abc } • L(hello|goodbye) = {hello, goodbye} • L([1-9][0-9]*) = _______________ • Idea: define each kind of token using RE

Language – set of strings String – finite sequence of symbols Symbols – taken from a finite alphabet Specify languages using regular expressions Regular expressions

Convenient Shorthand • [abcd] one of the listed characters (a | b | c | d) • [b-g] [bcdefg] • [b-gM-Qkr] ____________ • [^ab] anything but one of the listed chars • [^a-f] ____________ • M? Zero or one M • M+ One or more M • M* ____________ • “a.+*” literally a.+* • . Any single character (except \n)

Examples • Regular ExpressionStrings in L(R) • digit = [0-9] “0” “1” “2” “3” … • posint = digit+ “8” “412” … • int = -? posint “-42” “1024” … • real = int (ε | (. posint)) “-1.56” “12” “1.0” • [a-zA-Z_][a-zA-Z0-9_]* C identifiers • Lexer generators support abbreviations • But they cannot be recursive

More Examples • Whitespace: • Integers: • Hex numbers: • Valid UVa User Ids: • Loop keywords in C:

else x = 0 ; elsex = 0 ; Breaking up Text • elsex=0; • REs alone not enough: need rules for choosing • Most languages: longest matching token wins • even if a shorter token is only way • Ties in length resolved by prioritizing tokens • RE’s + priorities + longest-matching token rule = lexer definition

Lexer Generator Specification • Input to lexer generator: • list of regular expressions in priority order • associated action for each RE (generates appropriate kind of token, other bookkeeping) • Output: • program that reads an input stream and breaks it up into tokens according to the REs. (Or reports lexical error -- “Unexpected character” )

Lex: A Lexical Analyzer Generator • Lex produces a C program from a lexical specification • http://www.epaperpress.com/lexandyacc/ • %% • DIGITS [0-9]+ • ALPHA [A-Za-z] • CHARACTER {ALPHA}|_ • IDENTIFIER {ALPHA}({CHARACTER}|{DIGITS})* • %% • if {return IF; } • {IDENTIFIER} {return ID; } • {DIGITS} {return NUM; } • ([0-9]+”.”[0-9]*)|([0-9]*”.”[0-9]+) {return ____; } • . {error(); }

Lexer Generator • Reads in list of regular expressions R1,…Rn, one per token, with attached actions • -?[1-9][0-9]* { return new Token(Tokens.IntConst, • Integer.parseInt(yytext()) • } • Generates scanning code that decides: • whether the input is lexically well-formed • corresponding token sequence • Problem 1 is equivalent to deciding whether the input is in the language of the regular expression • How can we efficiently test membership in L(R) for arbitrary R?

Regular Expression Matching • Sketch of an efficient implementation: • start in some initial state • look at each input character in sequence, update scanner state accordingly • if state at end of input is an accepting state, the input string matches the RE • For tokenizing, only need a finite amount of state: (deterministic) finite automaton (DFA) or finite state machine

High Level View • Regular expressions = specification • Finite automata = implementation • Every regex has a FSA that recognizes its language source code tokens Scanner Compile time Design time Scanner Generator specification

i f a-z a-z 0 1 2 0 1 0-9 ID Finite Automata • Takes an input string and determines whether it’s a valid sentence of a language • A finite automaton has a finite set of states • Edges lead from one state to another • Edges are labeled with a symbol • One state is the start state • One or more states are the final state 26 edges IF

Language • Each string is accepted or rejected • Starting in the start state • Automaton follows one edge for every character (edge must match character) • After n-transitions for an n-character string, if final state then accept • Language: set of strings that the FSA accepts i f [a-z0-9] [a-z0-9] 0 1 2 3 ID IF ID [a-hj-z]

Lexical Analysis: Regular Expressions