140 likes | 287 Views
LANGUAGE TRANSLATORS: WEEK 14. LECTURE: REGULAR EXPRESSIONS FINITE STATE MACHINES LEXICAL ANALYSERS INTRO TO GRAMMAR THEORY TUTORIAL: CAPTURING LANGUAGES USING REGULAR EXPRESSIONS. LEXICAL ANALYSIS. Is the first step in the translation/compilation process
E N D
LANGUAGE TRANSLATORS: WEEK 14 LECTURE: REGULAR EXPRESSIONS FINITE STATE MACHINES LEXICAL ANALYSERS INTRO TO GRAMMAR THEORY TUTORIAL: CAPTURING LANGUAGES USING REGULAR EXPRESSIONS
LEXICAL ANALYSIS • Is the first step in the translation/compilation process input language ====> output language • means putting the raw characters of the input into TOKENS.
LEXICAL ANALYSIS PHASE • The language of TOKENS e.g. Identifiers is always a regular language. • REGULAR EXPRESSIONS generate regular languages (as do Regular Grammars..) The tokens of languages are often specified by regular expressions. • Finite State Machines consume regular languages
REGULAR EXPRESSIONS • One line method of specifying a language • equivalent to `type 3’ or regular grammars • used to parameterize UNIX/LINUX file processing commands
REGULAR EXPRESSIONS - DEFINITION EXAMPLE DEFINITION a | b ‘|’ means choice a | b | c = [abc] ‘[..]’ is shorthand for multiple choice e ‘e‘ means the empty word (abc)* ‘*’ means repetition 0,1 or more .. (abcd)+ ‘+’ means repetition 1 or more times
REGULAR EXPRESSIONS - EXAMPLES • [a - z A - Z][a - z A - Z 0 - 9]* defines the language of IDENTIFIERS in some programming languages • (xyz)* defines the language {e , xyz, xyzxyz, xyzxyzxyz, ..} • [abcd]+ defines the language {a, b, c, d, aa, ab, ac, ad, ba, bb, bc, bd, ca, ..} Putting choice and repetition together produces complicated regular languages
Finite State Machines • Can be defined by annotated nodes and arcs. • Can translate Reg. Exps into FSMs but must add ERROR STATES onto the FSMs
Regular Expression ==> NDFSM ab [ab] a* then NDFSM ==> FSM.. a b a b a
Example • Specify a language of alphabet { w,x,y,z} with the only restrictions being that • 1. no strings contain both x and y, and • 2. If there is a y and w in a string, then the first w ALWAYS occurs before the first y SOLUTION: • 1. Write down exs and counter exs • 2. Decide on any ambiguities 3.. Use Case Analysis to sub-divide the problem language = (a) strings of { w,x,z} UNION (b)strings of { w,y,z} with restriction 2. - Part (a): = [w x z]+ - Part (b): can assume y is always in a string = [y z]+ | z* w [wz]* y [x y z]* -. Put together answer = [w x z]+ | [y z]+ | z* w [wz]* y [x y z]*
A LEXICAL ANALYSER - GENERATOR (e.g. LEX, JLEX) - how they work • INPUT REGULAR EXPRESSIONS • TRANSLATE REGULAR EXPRESSION INTO NON-DETERMINISTIC FSM • TRANSLATE NON-DETERMINISTIC FSM INTO DETERMINISTIC FSM (which is easily described as a simple program)
EXAMPLE INPUT TOA LEXICAL ANALYSER - GENERATOR %% ";" { return new Symbol(sym.SEMI); } "+" { return new Symbol(sym.PLUS); } "*" { return new Symbol(sym.TIMES); } "(" { return new Symbol(sym.LPAREN); } ")" { return new Symbol(sym.RPAREN); } [0-9]+ { return new Symbol(sym.NUMBER, new Integer(yytext())); } [ \t\r\n\f] { /* ignore white space. */ } . { System.err.println("Illegal character: "+yytext()); } example; if string (231+3)*3 was input to the generated lexical analyser the output would be: LPAREN (NUMBER,231) PLUS (NUMBER,3) RPAREN TIMES (NUMBER,3)
{ for (;;) switch (next_char) { case '0': case '1': case '2': case '3': case '4': case '5': case '6': case '7': case '8': case '9': /* parse a decimal integer */ int i_val = 0; do { i_val = i_val * 10 + (next_char - '0'); advance(); } while (next_char >= '0' && next_char <= '9'); return new Symbol(sym.INT, new Integer(i_val)); case 'p': advance(); return new Symbol(sym.PRINT); case 'r': advance(); return new Symbol(sym.REPEAT); case 'u': advance(); return new Symbol(sym.UNTIL); case '=': advance(); return new Symbol(sym.ASSIGNS); case ';': advance(); return new Symbol(sym.SEMI); case '+': advance(); return new Symbol(sym.PLUS); case '-': advance(); return new Symbol(sym.MINUS); case '(': advance(); return new Symbol(sym.LPAREN); case ')': advance(); return new Symbol(sym.RPAREN); case 'x': advance(); return new Symbol(sym.ID,"x"); case 'y': advance(); return new Symbol(sym.ID,"y"); case 'z': advance(); return new Symbol(sym.ID,"z"); case -1: return new Symbol(sym.EOF); default: advance(); break; } } }; Simple Lexical Analyser public class scanner { protected static int next_char; protected static void advance() throws java.io.IOException { next_char = System.in.read(); } public static void init() throws java.io.IOException { advance(); } public static Symbol next_token() throws java.io.IOException
Introduction to Grammar Theory • Grammars can be used to generate the syntax of all formal languages – the structural complexity of a language is determined by the simplest grammar that can generate it. • In order to create parsers, we are interested in “properties of grammars”. For example, the “first set” of a string w of terminals and non-terminals is the set of TERMINAL symbols (tokens) that may be at the front of ANY string derived from w using the grammar rules.
Summary: • Regular expressions are a quick and easy way to specify simple forms of language. They can be easily translated into FSMs (which have nice properties e.g. they have linear time complexity in their execution) • There are tools (JLEX) which input regular expressions and output a lexical analyser which recognises the language they define.