590 likes | 994 Views
JavaCC. Programaci ón de Sistemas. Que es un generador de parsers?. Scanner. Parser. asignación. Total. =. Expr. Parser generator (JavaCC). id + id. precio. iva. Especificación lexica+gramatical. JavaCC.
E N D
JavaCC Programación de Sistemas
Que es un generador de parsers? Scanner Parser asignación Total = Expr Parser generator (JavaCC) id + id precio iva Especificación lexica+gramatical
JavaCC • JavaCC (Java Compiler Compiler) es un generador de scanner y parser • Producir un scanner y/o parser escrito en java, mismo que está escrito en Java; • Hay muchos generadores de parsers • yacc (Yet Another Compiler-Compiler) para el lenguaje de programación C • Bison de gnu.org • Hay también muchos generadores de parsers escritos en Java¿ • JavaCUP; • ANTLR; • SableCC
Más sobre la clasificación de generadores de parsers en java • Herramientas generadoras de Parsers ascendentes. • JavaCUP; • jay, YACC for Java www.inf.uos.de/bernd/jay • SableCC, The Sable Compiler Compiler www.sablecc.org • Herramientas generadoras de Parsers descendentes • ANTLR, Another Tool for Language Recognition www.antlr.org • JavaCC, Java Compiler Compiler www.webgain.com/java_cc
Características de JavaCC • Generador de Parsers descendentes LL(K) • Especificación Lexica y gramática en un archivo • Procesador Tree Building • con JJTree • Extremadamente Ajustable • Muchas opciones diferentes seleccionables • Generación de Documentación • Utilizando JJDoc • Internacionalización • Puede manejar unicode completo • Lookahead Sintáctico y Semántico
Características de JavaCC (cont.) • Permite especificaciones extendidas en BNF • Puede utilizar | * ? + () en RHS. • Estados y acciones Lexicas. • Análisis léxico sensitivo a mayúsculas y minúsculas • Capacidad de depuración extensiva • Tokens especiales • Reporteador de error muy bueno
Instalación de JavaCC • Descargar el archivo javacc-3.X.zip desde https://javacc.dev.java.net/ • Seguir el enlace que dice Download o ir directamente a https://javacc.dev.java.net/servlets/ProjectDocumentList • unzip javacc-3.X.zip en un directorio %JCC_HOME% • add %JCC_HOME\bin directory to your %path%. • javacc, jjtree, jjdoc may now be invoked directly from the command line.
Pasos para usar JavaCC • Escribir una especificación JavaCC (.jj file) • Define la gramática y acciones en un archivo (digamos, calc.jj) • Ejecutar javaCC para generar un scanner y un parser • javacc calc.jj • Generará el parser, scanner, token,… java sources • Escribe el programa que utilice el parser • Por ejemplo, UseParser.java • Compilar y ejecutar el programa • javac -classpath . *.java • java -cp . mainpackage.MainClass
Ejemplo 1 • Grammar : re.jj • Ejemplo • % todas las cadenas terminan en "ab" • (a|b)*ab; • aba; • ababb; • Nuestras tareas: • Por cada cadena de entrada (Linea 3,4) determinar cuando coincida con la expresión regular (linea 2). Parsear una especificación de expresiones regulares y que coincidan con las cadenas de entrada
tokens REParserTokenManager REParser MainClass resultado javaCC re.jj La película completa % comentario (a|b)*ab; a; ab;
Formato de una gramática de entrada para JavaCC • javacc_options • PARSER_BEGIN ( <IDENTIFIER>1 ) unidad_de_compilación_de_java PARSER_END ( <IDENTIFIER>2 ) • ( produccion )*
El archivo de especificación de entrada (re.jj) options { USER_TOKEN_MANAGER=false; BUILD_TOKEN_MANAGER=true; OUTPUT_DIRECTORY="./reparser"; STATIC=false; }
re.jj PARSER_BEGIN(REParser) package reparser; import java.lang.*; … import dfa.*; public class REParser { public FA tg = new FA(); // mensaje de error con la linea actual public static void msg(String s) { System.out.println("ERROR"+s); } public static void main(String args[]) throws Exception REParser reparser = new REParser(System.in); reparser.S(); } } PARSER_END(REParser)
re.jj (Definición de tokens) TOKEN : { <SYMBOL: ["0"-"9","a"-"z","A"-"Z"] > | <EPSILON: "epsilon" > | <LPAREN: "(“ > | <RPAREN: ")“ > | <OR: "|" > | <STAR: "*“ > | <SEMI: ";“ > } SKIP: { < ( [" ","\t","\n","\r","\f"] )+ > |< "%" ( ~ ["\n"] )* "\n" > { System.out.println(image); } }
re.jj (producciones) void S() : { FA d1; } { d1 = R() <SEMI> { tg = d1; System.out.println("------NFA"); tg.print(); System.out.println("------DFA"); tg = tg.NFAtoDFA(); tg.print(); System.out.println("------Minimizar"); tg = tg.minimize(); tg.print(); System.out.println("------Renumerar"); tg=tg.renumber(); tg.print(); System.out.println("------Ejecutar"); } testCases() }
re.jj void testCases() : {} { (testCase() )+ } void testCase(): { String testInput ;} { testInput = symbols() <SEMI> { tg.execute( testInput) ; } } String symbols() : {Token token = null; StringBuffer result = new StringBuffer(); } { ( token = <SYMBOL> { result.append( token.image) ; } )* { return result.toString(); } }
re.jj (expresiones regulares) // R --> RUnit | RConcat | RChoice FA R() : {FA result ;} { result = RChoice() { return result; } } FA RUnit() : { FA result ; Token d1; } { ( <LPAREN> result = RChoice() <RPAREN> |<EPSILON> { result = tg.epsilon(); } | d1 = <SYMBOL> { result = tg.symbol( d1.image ); } ) { return result ; } }
re.jj FA RChoice() : { FA result, temp ;} { result = RConcat() (<OR> temp = RConcat() { result = result.choice( temp ) ;} )* {return result ; } } FA RConcat() : { FA result, temp ;} { result = RStar() ( temp = RStar() { result = result.concat( temp ) ;} )* {return result ; } } FA RStar() : {FA result;} { result = RUnit() (<STAR> { result = result.closure();} )* { return result; } }
Formato de una gramática de entrada de JavaCC javacc_input ::=javacc_options PARSER_BEGIN (<IDENTIFIER>1) unidad_de_compilacion_de_java PARSER_END ( <IDENTIFIER>2 ) ( production )* <EOF> Codigo de color: • azul --- no-terminal • <naranja> – un tipo de token • morado --- lexema ( palabra reservada; • I.e., consistente de la literal en sí misma) • negro -- meta simbolos
Notas • <IDENTIFIER> significa cualquier identificador de Java como var, class2, … • IDENTIFIER significa solamente IDENTIFIER. • <IDENTIFIER>1 debe ser igual a <IDENTIFIER>2 • unidad_de_compilacio_de_java es cualquier codigo de java que como un todo puede aparecer legalmente en un archivo. • Debe contener una declaración de clase principal con el mismo nombre que <IDENTIFIER>1 . • Ejemplo: PARSER_BEGIN ( MiParser ) package mipackage; import miotropackage….; public class MiParser { … } class MiOtraClase { … } … PARSER_END (MiParser)
La entrada y salida de javacc (MiEspecifLeng.jj ) javacc Token.java PARSER_BEGIN ( MiParser ) package mipackage; import miotropackage….; public class MiParser { … } class MiOtraClase { … } … PARSER_END (MiParser) ParserError.java MyParser.java MyParserTokenManager.java MyParserCostant.java
Notes: • Token.java y ParseError.jar son los mismos para todas las entradas y pueden ser reutilizados. • package declaration in *.jj are copied to all 3 outputs. • import declarations in *.jj are copied to the parser and token manager files. • parser file is assigned the file name <IDENTIFIER>1 .java • The parser file has contents: …class MiParser { … //generated parser is inserted here. … } • The generated token manager provides one public method: Token getNextToken() throws ParseError;
javacc options javacc_options ::= [ options{ ( option_binding )* } ] • option_binding es de la forma : • <IDENTIFIER>3=<java_literal>; • donde <IDENTIFIER>3 no es sensible a mayúsculas y minúsculas. • Ejemplo: options { USER_TOKEN_MANAGER=true; BUILD_TOKEN_MANAGER=false; OUTPUT_DIRECTORY="./sax2jcc/personnel"; STATIC=false; }
More Options • LOOKAHEAD • java_integer_literal (1) • CHOICE_AMBIGUITY_CHECK • java_integer_literal (2) for A | B … | C • OTHER_AMBIGUITY_CHECK • java_integer_literal (1) for (A)*, (A)+ and (A)? • STATIC (true) • DEBUG_PARSER (false) • DEBUG_LOOKAHEAD (false) • DEBUG_TOKEN_MANAGER (false) • OPTIMIZE_TOKEN_MANAGER • java_boolean_literal (false) • OUTPUT_DIRECTORY (current directory) • ERROR_REPORTING (true)
More Options • JAVA_UNICODE_ESCAPE (false) • replace \u2245 to actual unicode (6 char 1 char) • UNICODE_INPUT (false) • input strearm is in unicode form • IGNORE_CASE (false) • USER_TOKEN_MANAGER (false) • generate TokenManager interface for user’s own scanner • USER_CHAR_STREAM (false) • generate CharStream.java interface for user’s own inputStream • BUILD_PARSER (true) • java_boolean_literal • BUILD_TOKEN_MANAGER (true) • SANITY_CHECK (true) • FORCE_LA_CHECK (false) • COMMON_TOKEN_ACTION (false) • invoke void CommonTokenAction(Token t) after every getNextToken() • CACHE_TOKENS (false)
Ejemplo: Figura 2.2 • if IF • [a-z][a-z0-9]* ID • [0-9]+ NUM • ([0-9]+”.”[0-9]*) | ([0-9]*”.”[0-9]+) REAL • (“--”[a-z]*”\n”) | (““|”\n” | “\t” )+ nonToken, WS • . error • Notaciones javacc • “if”or“i”“f”or [“i”][“f”] • [“a”-”z”]([“a”-”z”,”0”-”9”])* • ([“0”-”9”])+ • ([“0”-”9”])+ “.” ( [“0”-”9”] ) * | ([“0”-”9”])* ”.” ([“0”-”9”])+
Especificación JavaCC para algunos Tokens PARSER_BEGIN(MiParser) class MiParser{} PARSER_END(MiParser) /* Para la expresión regular en la derecha, se retornará el token a la izquierda */ TOKEN : { < IF: “if” > | < #DIGIT: [“0”-”9”] > |< ID: [“a”-”z”] ( [“a”-”z”] | <DIGIT>)* > |< NUM: (<DIGIT>)+ > |< REAL: ( (<DIGIT>)+ “.” (<DIGIT>)* ) | ( <DIGIT>+ “.” (<DIGIT>)* ) > }
Continuación /* Las expresiones regulares aquí serán omitidas durante el análisis léxico */ SKIP : { < ““> | <“\t”> |<“\n”> } /* como SKIP pero el texto saltado es accesible desde la acción del parser */ SPECIAL_TOKEN : { <“--” ([“a”-”z”])* (“\n” | “\r” | “\n\r” ) > } /* . Para cualquier subcadena que no coincida con la especificación léxica, javacc lanzara un error */ /* regla principal */ void start() : {} { (<IF> | <ID> |<NUM> |<REAL>)* }
La forma de una Producción java_return_typejava_identifier(java_parameter_list) : java_block {opciones_de_expansion} • Ejemplo : void XMLDocument(Logger logger): { int msg = 0; } { <StartDoc> { print(token); } Element(logger) <EndDoc> { print(token); } | else() }
Ejemplo ( Gramática ) • P L • S id := id • S while id do S • S begin L end • S if id then S • S if id then S else S • L S • L L;S 1,7,8 : P S (;S)*
JavaCC Version of Grammar 3.30 PARSER_BEGIN(MiParser) pulic class MiParser{} PARSRE_END(MiParser) SKIP : {““ | “\t” | “\n” } TOKEN: { <WHILE: “while”> | <BEGIN: “begin”> | <END:”end”> | <DO:”do”> | <IF:”if”> | <THEN : “then”> | <ELSE:”else”> | <SEMI: “;”> | <ASSIGN: “=“> |<#LETTER: [“a”-”z”]> | <ID: <LETTER>(<LETTER> | [“0”-”9”] )* > }
JavaCC Version of Grammar 3.30 (cont’d) void Prog() : { } { StmList() <EOF> } void StmList(): { } { Stm() (“;” Stm() ) * } void Stm(): { } { <ID> “=“ <ID> | “while” <ID> “do” Stm() | <BEGIN> StmList() <END> | “if” <ID> “then” Stm() [ LOOKAHEAD(1) “else” Stm() ] }
Tipos de producciones • production ::= javacode_production | regulr_expr_production | bnf_production | token_manager_decl Note: 1,3 se utilizan para definir gramáticas. 2 se usa para definir tokens 4 se usa para incrustar código en el token manager.
JAVACODE production • javacode_production ::= “JAVACODE” java-return_type iava_id “(“ java_param_list “)” java_block • Note: • Se utiliza para definir no-terminales para reconocer Used to define nonterminals for recognizing sth that is hard to parse using normal production.
Example JAVACODE JAVACODE void skip_to_matching_brace() { Token tok; int nesting = 1; while (true) { tok = getToken(1); if (tok.kind == LBRACE) nesting++; if (tok.kind == RBRACE) { nesting--; if (nesting == 0) break; } tok = getNextToken(); } }
Note: • Do not use nonterminal defined by JAVACODE at choice point without giving LOOKHEAD. • void NT() : {} { skip_to_matching_brace() | some_other_production() } • void NT() : {} { "{" skip_to_matching_brace() | "(" parameter_list() ")" }
TOKEN_MANAGER_DECLS token_manager_decls ::= TOKEN_MGR_DECLS :java_block • The token manager declarations starts with the reserved word "TOKEN_MGR_DECLS" followed by a ":" and then a set of Java declarations and statements (the Java block). • These declarations and statements are written into the generated token manager (MyParserTokenManager.java) and are accessible from within lexical actions. • There can only be one token manager declaration in a JavaCC grammar file.
regular_expression_production regular_expr_production ::= [ lexical_state_list ] regexpr_kind [ [IGNORE_CASE] ] : {regexpr_spec ( |regexpr_spec )* } • regexpr_kind::= TOKEN | SPECIAL_TOKEN | SKIP | MORE • TOKEN is used to define normal tokens • SKIP is used to define skipped tokens (not passed to later parser) • MORE is used to define semi-tokens (I.e. only part of a token). • SPECIAL_TOKEN is between TOKEN and SKIP tokens in that it is passed on to the parser and accessible to the parser action but is ignored by production rules (not counted as an token). Useful for representing comments.
lexical_state_list lexical_state_list::= < * > | <java_identifier ( ,java_identifier )* > • The lexical state list describes the set of lexical states for which the corresponding regular expression production applies. • If this is written as "<*>", the regular expression production applies to all lexical states. Otherwise, it applies to all the lexical states in the identifier list within the angular brackets. • if omitted, then a DEFAULT lexical state is assumed.
regexpr_spec regexpr_spec::= regular_expression1 [ java_block ] [ :java_identifier ] • Meaning: • When a regular_expression1 is matched then • if java_block exists then execute it • if java_identifier appears, then transition to that lexical state.
regular_expression regular_expression ::= java_string_literal | < [ [#] java_identifier: ] complex_regular_expression_choices> | <java_identifier> | <EOF> • <EOF> is matched by end-of-file character only. • (3) <java_identifier> is a reference to other labeled regular_expression. • used in bnf_production • java_string_literal is matched only by the string denoted by itself. • (2) is used to defined a labled regular_expr and not visible to outside the current TOKEN section if # occurs. • (1) for unnamed tokens
Example <DEFAULT, LEX_ST2> TOKEN IGNORE_CASE : { < FLOATING_POINT_LITERAL: (["0"-"9"])+ "." (["0"-"9"])* (<EXPONENT>)? (["f","F","d","D"])? | "." (["0"-"9"])+ (<EXPONENT>)? (["f","F","d","D"])? | (["0"-"9"])+ <EXPONENT> (["f","F","d","D"])? | (["0"-"9"])+ (<EXPONENT>)? ["f","F","d","D"] > { // do Something } : LEX_ST1 | < #EXPONENT: ["e","E"] (["+","-"])? (["0"-"9"])+ > } • Note: if # is omitted, E123 will be recognized erroneously as a token of kind EXPONENT.
Structure of complex_regular_expression • complex_regular_expression_choices::= complex_regular_expression (| complex_regular_expression )* • complex_regular_expression ::= ( complex_regular_expression_unit )* • complex_regular_expression_unit ::= java_string_literal| "<" java_identifier ">" | character_list | (complex_regular_expression_choices) [+|*|?] • Note: unit concatenation;juxtaposition complex_regular_expression choice; | complex_regular_expression_choice (.)[+|*|?] unit
character_list character_list::= [~] [ [ character_descriptor ( ,character_descriptor )* ] ] character_descriptor::= java_string_literal [ -java_string_literal ] java_string_literal ::= // reference to java grammar “singleCharString* “ note:java_sting_literalhere is restricted to length 1. ex: • ~[“a”,”b”] --- all chars but a and b. • [“a”-”f”, “0”-”9”, “A”,”B”,”C”,”D”,”E”,”F”] --- hexadecimal digit. • [“a”,”b”]+ is not a regular_expression_unit. Why ? • should be written ( [“a”,”b”] )+ instead.
bnf_production • bnf_production::= java_return_typejava_identifier "(" java_parameter_list ")" ":" java_block "{" expansion_choices "}“ • expansion_choices::= expansion ( "|" expansion )* • expansion::= ( expansion_unit )*
expansion_unit • expansion_unit::= local_lookahead | java_block | "(" expansion_choices ")" [ "+" | "*" | "?" ] | "[" expansion_choices "]" | [ java_assignment_lhs "=" ] regular_expression | [ java_assignment_lhs "=" ] java_identifier "(" java_expression_list ")“ Notes: 1 is for lookahead; 2 is for semantic action 4 = ( …)? 5 is for token match 6. is for match of other nonterminal
lookahead • local_lookahead::= "LOOKAHEAD" "(" [ java_integer_literal ] [ "," ] [ expansion_choices ] [ "," ] [ "{" java_expression "}" ] ")“ • Notes: • 3 componets: max # lookahead + syntax + semantics • examples: • LOOKHEAD(3) • LOOKAHEAD(5, Expr() <INT> | <REAL> , { true} ) • More on LOOKAHEAD • see minitutorial
JavaCC API • Non-Terminals in the Input Grammar • NT is a nonterminal => returntype NT(parameters) throws ParseError; is generated in the parser class • API for Parser Actions • Token token; • variable always holds the last token and can be used in parser actions. • exactly the same as the token returned by getToken(0). • two other methods - getToken(int i) and getNextToken() can also be used in actions to traverse the token list.