360 likes | 526 Views
Lecture 3. Introduction to JLex: a lexical analyzer generator for Java. The role of JLex. Lscanner.lex. JLex. Lscanner.java. input.L. Lscanner.class …. javac. char stream. token stream. Ltokens…. JLex Specifications. user code
E N D
Lecture 3 Introduction to JLex: a lexical analyzer generator for Java
The role of JLex Lscanner.lex JLex Lscanner.java input.L Lscanner.class … javac char stream token stream Ltokens…
JLex Specifications user code %% // must at the beginning of a line JLex directives %% // must at the beginning of a line lexical rules • Each spec file consists of 3 sections, seperated by %% • user code copied to output file • directives include macro and state definitions, among others. • 3rd section contains the rules of lexical analysis, each of which consists of three parts: an optional state list, a regular expression, and an action.
The layout of the generated file %userCode// from 1st section: package, import decls+utility classes %public class %class[implements %implements] { %nternalCode// from %{ … %} directive // 2 constructors %public%class( InputStream is) [throws %initthrow]{ %initCode]// from %init{ … %init} directive …} // and%public%class( Reader is) throws …{…} // main methods for requesting next token % public %type%function() [throws %yylexthrow] { … // if eof => return ( %eofValue ) … } // method to be called after eof encountered private void yy_do_eof () [throws %eofthrow]{ ... %eofCode ... } … }
1 Internal Code to Lexical Analyzer Class 2 Initialization Code for Lexical Analyzer Class 3 End-of-File Code for Lexical Analyzer Class 4 Macro Definitions 5 State Declarations 6 Character Counting 7 Line Counting 8 Java CUP Compatibility 9 Lexical Analyzer Component Titles 10 Default Token Type: int 11 Default Token Type II: Wrapped Integer 12 YYEOF on End-of-File 13 Newlines and Operating System Compatibility 14 Character Sets 15 Character Format To and From File 16 Exceptions Generated by Lexical Actions 17 Specifying the Return Value on End-of-File 18 Specifying an interface to implement 19 Making the Generated Class Public JLex Directives
Directives for determining the names of various components of the lexer. • The name of the generated class (as well as the file name) %classclassName// default is Yylex • The interface the lexer class would implement %implements interfaceName • The name and return type of the method to get the next token %function methodName// default is yylex %type typeName// default is Yytoken • make the lexer class public %public
Directives for position information • Enabling the counting of character position %char // private int yychar declared • Enabling the counting of line information %line // private int yyline declared Notes: • yychar and yyline are zero-based. • yychar is used to record the position of the beginning of the current token in the input stream. • yylength (always enabled) is used to record the length of the text the current token consumes.
Java codes to be put on various parts of the generated file • user code to be put outside the lexer class [all text from 1st section]// before first %% • user code to be put inside the lexer class • user code to be put inside the constructors of the lexer class • user code to be put inside the body of the yy_do_eof() method. • value to be return when eof is encountered.
User code to be put inside the lexer class • format: %{// at the beginning of line <internal code> %}// at the beginning of line • Permit the declaration of variables and methods inside the generated lexer class • Correspond to the %internalCode region.
User Code to be put inside all constructors of the lexer class • format: %init{// at the beginning of line <initCode> %init}// at the beginning of line • Correspond to the %initCode region. • Exceptions thrown should be declared by the directive: %initthrow{ Exception0 , …, ExceptionN %initthrow} // corresponds to%initthrowregion
Directives for Specifying the input alphabet %full %unicode • default alphabet is ASCII ( 0~127) • %full => 0~255; %unicode => 0 ~65535. %ignorecase • upper case and lower case letters regarded as the same.
Directives related to eof processing • Specifying the Return Value on End-of-File %eofval{ eofValue %eofval} • YYEOF on End-of-File %yyeof • notes: • Enable the decl: public final int YYEOF=-1; in lexer • implied by the dir: %integer
User Code to be executed when end_of_file is encountered • format: %eof{// at the beginning of line <eofCode> %eof}// at the beginning of line • Correspond to the %eofCode region. • Exceptions thrown should be declared by the directive: %eofthrow{ Exception0 , …, ExceptionN %eofthrow} // corresponds to%eofthrowregion
Specifying the type of the returned token %type typeName %integer // equ to %type int %intwrap // equ to %type java.lang.Integer Notes: • Default type is Yytoken (need to be declared elsewhere, say, in user code) • null will be returned for eof token if the returned type is not primitive. • YYEOF (-1) will be returned for %integer.
Java CUP Compatibility %cup • this directive makes the generated scanner conform to the java_cup.runtime.Scanner interface. • has the same effect as the following three directives: %implements java_cup.runtime.Scanner %function next_token %type java_cup.runtime.Symbol
Newlines and Operating System Compatibility • new line represented differently in UNIX and DOS based OSs. unix => \n dos => \r\n • The directive %notunix cause the lexer to recognize either \r or \n as a new line.
Exceptions Generated by Lexical Actions • Format: %yylexthrow{ Exception0,…,ExceptionN %yylexthrow} • Notes: • mapped to the %yylexthrow region. • are Exceptions that may be thrown from within the action codes of lexical rules.
State Declarations • Format: %statestate0,…, stateN • Notes: • state0,..stateN must be at the same line. • can have more than one %state declarations • State names should be valid identifiers • Each stateK will be declared as an int constants in the lexer class. • A special state YYINITIAL is implicitly declared and the lexer begins its analysis in this state.
Macro Definitions • used to name and define sets of strings for later use of lexical rules. • format: MacroName=MacroDefinition • Notes: • Each macro definition is contained on a single line • MacroName should be a valid id (letter|_)(letter|digit|_)* • MacroDefinition should be a valid regular expression to be defined later. • MacroDefintion may contain other macro expansion in the form {otherMacroName}, but recursion is not permitted.
Lexical Rules • Format: [<state1,…statesN>]expression{actionCode} • Notes: • All stateKs must have been declared by %state. • the rule will be activated only when the lexer is in one of the state listed in the state list. • if state list omitted, it is always activated. • the intuitive meaning of the rule is as follows: • if the lexer is in one of the state in the list and the substring from the current position matches the expression, then execute the actionCode.
Conflict resolution • What happens If more than one rule matches strings from its input? 1. Choose the rule that matches the longest string. 2. If more than one rule matches strings of the same length, then choose the rule that is given first in the JLex specification. • Therefore, rules appearing earlier in the specification are given a higher priority by the generated lexer.
Regular Expressions • The alphabet for JLex is the Ascii character set, meaning character codes between 0 and 127 inclusive • non_newline white spaces in expressions is not allowed unless withnin double quotes “ … “ or immediately after \. • metacharacters: are chars with special meanings in JLex regular expressions. ? * + | ( ) ^ $ . [ ] { } “ \ • Other chars represent themselves.
Escape sequences for characters • \dddThe character with number (ddd)8 • \xddThe character with number (dd)16 • \uddddThe Unicode character with number (dddd)16. • \b Backspace \n newline • \t Tab \f Formfeed • \r Carriage return • \^CControl character(0~31: \^@, \^A,…Z,[,\,],^,_) • \cA backslash followed by any other character c matches itself: Ex: \\, \a, \B, \”, \’, etc. • $ denotes the end of a line. • . matches any character except the newline, equ to [^\n].
More on regular expression • “…aString…" denotes aString. • Metacharacters in aString loose their meaning and represent themselves. • The sequence \" which represents " is the only exception. • Ex: “ab d\\\”” stands for ab d\\” • {name} denote a macro expansion • E1E2 : concatenation • E1|E2: choice • E+ or (E)+ : one or more repetitions of E, • E* or (E)* : zero or more repetitions of E. • E? or (E)? : zero or one repetitions of E. • (E) : (..) is used for grouping.
More on regular expressions • [...] • Square backets denote a class of characters and match any one character enclosed in the backets. • substring inside with special meaning: • {name} : macro expansion • a-b : range of characters from a to b. • “String” means String with metachars loosing special meaning. • \a means a wherea is any character. • [^Rest] means S – [Rest]
More on regular expressinos • Ex: • [a-z] match a,b,…,z. • [^0-9] matches any char but 0,1,…,9. • [\”\\] matches “ or \. • [“a-z”] matches a,- and z. • [-0-9] matches -,0,..,9. • how about [\b\f”\r\t”] ?
Lexical Actions • format: {action} • notes: All curly braces contained in action not part of strings or comments should be balanced. • Actions and Recursion: • If no return value is returned in an action, the lexical analyzer will search for the next match from the input stream and returning the value associated with that match. • The lexical analyzer can be made to recur explicitly with a call to yylex(), as in the following code fragment.{ ...return yylex();... }
More on lexical actions • State transitions are made by the function call.yybegin(state); • Avilable Lexical methods / vars: String yytext() Matched portion of the character input stream int yylength() length of yytext() int yychar; int yyline;
Performance • Size of JLex generated Lexer Hand-Written Lexer Source File Execution Time Execution Times 177 lines 0.42 seconds 0.53 seconds 897 lines 0.98 seconds 1.28 seconds • The JLex lexical analyzer soundly outperformed the hand-written lexer!!
Example import java.lang.System; class Sample { public static void main(String argv[]) throws java.io.IOException { Yylex yy = new Yylex(System.in); Yytoken t; while ((t = yy.yylex()) != null) System.out.println(t); } }
class Utility { public static void assert ( boolean expr ) { if (false == expr) { throw (new Error("Error: Assertion failed.")); } } private static final String errorMsg[] = { "Error: Unmatched end-of-comment punctuation.", "Error: Unmatched start-of-comment punctuation.", "Error: Unclosed string.", "Error: Illegal character." }; public static final int E_ENDCOMMENT = 0; public static final int E_STARTCOMMENT = 1; public static final int E_UNCLOSEDSTR = 2; public static final int E_UNMATCHED = 3; public static void error ( int code ) { System.out.println(errorMsg[code]); } }
class Yytoken { Yytoken ( int index, String text, int line, int charBegin, int charEnd ) { m_index = index; m_text = new String(text); m_line = line; m_charBegin = charBegin; m_charEnd = charEnd; } public int m_index; public String m_text; public int m_line; public int m_charBegin; public int m_charEnd; public String toString() { return "Token #"+m_index+": "+m_text+" (line "+m_line+")"; } }
%% %{ private int comment_count = 0; %} %line %char %state COMMENT ALPHA=[A-Za-z] DIGIT=[0-9] NONNEWLINE_WHITE_SPACE_CHAR=[\ \t\b\012] WHITE_SPACE_CHAR=[\n\ \t\b\012] STRING_TEXT= (\\\"|[^\n\"]|\\{WHITE_SPACE_CHAR}+\\)* COMMENT_TEXT=([^/*\n]|[^*\n]"/"[^*\n]|[^/\n]"*"[^/\n]|"*"[^/\n]|"/"[^*\n])* %%
<YYINITIAL> "," { return (new Yytoken(0,yytext(),yyline,yychar,yychar+1)); } <YYINITIAL> ":" { return (new Yytoken(1,yytext(),yyline,yychar,yychar+1)); } <YYINITIAL> ";" { return (new Yytoken(2,yytext(),yyline,yychar,yychar+1)); } <YYINITIAL> "(" { return (new Yytoken(3,yytext(),yyline,yychar,yychar+1)); } … <YYINITIAL> "<>" { return (new Yytoken(15,yytext(),yyline,yychar,yychar+2)); } … <YYINITIAL> "<" { return (new Yytoken(16,yytext(),yyline,yychar,yychar+1)); } <YYINITIAL> "<=" { return (new Yytoken(17,yytext(),yyline,yychar,yychar+2)); }… <YYINITIAL> "|" { return (new Yytoken(21,yytext(),yyline,yychar,yychar+1)); } <YYINITIAL> ":=" { return (new Yytoken(22,yytext(),yyline,yychar,yychar+2)); }
<YYINITIAL> {NONNEWLINE_WHITE_SPACE_CHAR}+ { } <YYINITIAL,COMMENT> \n { } <YYINITIAL> "/*" { yybegin(COMMENT); comment_count = comment_count + 1; } <COMMENT> "/*" { comment_count = comment_count + 1; } <COMMENT> "*/" { comment_count = comment_count - 1; Utility.assert(comment_count >= 0); if (comment_count == 0) {yybegin(YYINITIAL);}} <COMMENT> {COMMENT_TEXT} { } <YYINITIAL> \"{STRING_TEXT}\" { String str = yytext().substring(1,yytext().length() - 1); Utility.assert(str.length() == yytext().length() - 2); return (new Yytoken(40,str,yyline,yychar,yychar + str.length())); }
<YYINITIAL> \"{STRING_TEXT} { String str = yytext().substring(1,yytext().length()); Utility.error(Utility.E_UNCLOSEDSTR); Utility.assert(str.length() == yytext().length() - 1); return (new Yytoken(41,str,yyline,yychar,yychar + str.length()));} <YYINITIAL> {DIGIT}+ { return (new Yytoken(42,yytext(),yyline,yychar,yychar + yytext().length()));} <YYINITIAL> {ALPHA}({ALPHA}|{DIGIT}|_)* { return (new Yytoken(43,yytext(),yyline,yychar,yychar + yytext().length())); } <YYINITIAL,COMMENT> . { System.out.println("Illegal character: <" + yytext() + ">"); Utility.error(Utility.E_UNMATCHED);}