470 likes | 651 Views
Submissions: 99 Average for A2: 71% Early submission bonus: 1 Full marks: 5 16 teams attempted nonce bonus 7 got full marks 7 teams attempted ACC bonus 7 got full marks. Can we generate code to support mundane coding tasks and safe time?. Scanning & Parsing with Lex and YACC.
E N D
Submissions: 99 • Average for A2: 71% • Early submission bonus: 1 • Full marks: 5 • 16 teams attempted nonce bonus • 7 got full marks • 7 teams attempted ACC bonus • 7 got full marks Can we generate code to support mundane coding tasks and safe time? Scanning & Parsing with Lex and YACC Give you an example for Milestone 1. Hans-Arno Jacobsen ECE 297 Powerful, but not easy
CoursePeer – try it out! • Developed by a former ECE297 student • Many of the videos under tips & tricks are from him too • Short video about CoursePeer • To sign up and auto-enrol under ECE297, use this link • http://www.crspr.com/?rid=339 • Will have a quick demo and use it on Wednesday for our Q&A session
Know your tools! • Can we generate code based on a specification of what we want? • Is the specification simpler than writing a program for doing the same task? • Fully automated program generation has been a dream since the early days of computing.
Where do we need parsing in the storage server? • Configuration file (file) • Bulk loading of data files (file) • Protocol messages (network) • Command line arguments (string)
Parsing PROPERTY VALUE PROPERTY VALUE (TABLE TABLE-NAME)+ PROPERTY VALUE server_host localhost server_port 1111 table marks data_directory ./data Tokens default.conf – the way the disk may see it server_host localhost \n server_port 1111 \n table marks \n # This data directory may be an absolute or relative path. \n data_directory ./data \n\n\n \EOF
ScenariosWhere we’d like to safe time in writing a quick language processor? Conceptually speaking In our storage servers Languages Data schema & data Query language Output formatting (Web, Latex, PDF, Word, Excel) Storage server configuration Benchmarking • Languages • Data description language • Script language • Markup language • System configurations • Workload generation
Parser generation from 30K feet Written by developer Specification Specification Generator Generated code Generator Other code Written by developer Other code Compiler / Linker Execut- able
Scanning & parsing I server_host localhost \n server_port 1111 \n table marks \n # This data PROPERTY VALUE PROPERTY VALUE … Scanning PROPERTY VALUE PROPERTY VALUE (TABLE TABLE-NAME)+ PROPERTY VALUE Parsing Verify content, add to data structures, … Processing
Regular expressions Patterns • (TABLE TABLE-NAME)+ • TABLE TABLE-NAME • TABLE TABLE-NAME TABLE TABLE-NAME • … • Regular expressions (formal languages) • Extended regular expressions (UNIX)
Scanning & parsing II • Parsing is really two steps • Scanning (a.k.a. tokenizing or lexical analysis) • Parsing, i.e., analysis of structure and syntax according to a grammar (i.e., a set of rules) • flex is the scanner generator (open source) • Fast Lex for lexical analysis • YACC is the parser generator • Yet Another Compiler Compiler for structural and syntax analysis • Lex and YACC work together • Generated scanner drives the generated parser • We use flex (fast Lex) and Bison (GNU YACC) • There are myriads of other tools for Java, C++, …, some of which combine Lex/Yacc into one tool (e.g., javacc)
Objectives for today • Cover the basics of Lex & Yacc • Everybody should have an appreciation of the potential of these tools • There is a lot more detail that remains unsaid • To challenge you
Lex & YACC overview server_host localhost \n server_port 1111 \n table marks \n # This data directory may be an absolute or relative path. \n data_directory ./data \n\n\n \EOF Lexical Analyzer input stream token stream PROPERTY VALUE PROPERTY VALUE Output defined by actions in parser specification (often an in-memory representation of input) Structural Analyzer token stream
Synonyms: lexical analyzer, scanner, lexer, tokenizer Lex introduction flex is fastLex Input specification (*.l) flex You can control the name of generated file lex.yy.c C compiler Lexical Analyzer input stream token stream You generate the lexical analyzer by using flex
Lex Input specification for lex – the “program” Three parts: Definitions, Rules, User code Use “%%” as a delimiter for each part First part: Definitions Options used by flex inside the scanner Defines variables & macros Code within “%{” and “%}” directly copied into the scanner (e.g., global variables, header files) Second part: Rules Patterns and corresponding actions Actions are executed when corresponding pattern(s) matches Patterns are defined by regular expressions
%{ #include "config_parser.tab.h" ... %} a2Z [a-zA-Z] host server_host port server_port dir data_directory %% {host} { return HOST_PROPERTY; } {port} { return PORT_PROPERTY; } table { return TABLE; } {dir} { return DDIR_PROPERTY; } [\t\n ]+ { } #.*\n { } {a2Z}* { yylval.sval = strdup(yytext); return STRING; } [0-9]+ { yylval.pval = (int) atoi(yytext); return PORT_NUMBER; } . { return yytext[0]; } … Parsing the configuration file of Milestone 1 Pattern Action Shorthands for use below config_parser.l
flex pattern matching principles • Actions are executed when patterns match • Tokens are returned to caller; next pattern … • Patterns match a given input character or string only once • Input stream is consumed • flex executes the action for the longest possible matching input • Order of patterns in the spec. is important
flex regular expressions by example I(Really: extended regular expressions) `x‘ match the character 'x' `.‘ any character (byte) except newline `[xyz]’ match either an 'x', a 'y', or a 'z' `[abj-oZ]‘ match an 'a', a 'b', any letter from 'j' through 'o', or a 'Z‘ `[^A-Z]‘ a "negated character class", i.e., any character EXCEPT those in the class `[^A-Z\n]’ any character EXCEPT an uppercase letter or a newline
flex regular expression by example II r is any regular expression `r*‘ zero or more r's, where r is any regular expression `r+‘ one or more r's `r?‘ zero or one r (that is, “an optional r”) ‘r{2,5}‘ anywhere from two to five r's `r{2,}‘ two or more r's `r{4}‘ exactly 4 r's ‘<<EOF>>' an end-of-file
flex regular expressions • There are many more expressions, see manual • Form complex expressions • E.g.: IP address, names, … • The expression syntax is used in other tools as well (well worth learning)
%{ #include "config_parser.tab.h" ... %} a2Z [a-zA-Z] host server_host port server_port dir data_directory %% {host} { return HOST_PROPERTY; } {port} { return PORT_PROPERTY; } table { return TABLE; } {dir} { return DDIR_PROPERTY; } [\t\n ]+ { } #.*\n { } {a2Z}* { yylval.sval = strdup(yytext); return STRING; } [0-9]+ { yylval.pval = (int) atoi(yytext); return PORT_NUMBER; } . { return yytext[0]; } <<EOF>> { return 0; } Parsing the configuration file of Milestone 1 User-defined variable in YACC (conveys token value to YACC) server_hostlocalhost server_port 1111 table marks data_directory ./data config_parser.l
YACC introducing You can control the name of generated file Input specification (*.y) YACC y.tab.c C compiler Output defined by actions in parser specification Syntax analyzer / parser token stream, e.g., via flex From the specified grammar, YACC generates a parser which recognizes “sentences” according to the grammar
YACC Input specification for YACC (similar to flex) Three parts: Definitions, Rules, User code Use “%%” as a delimiter for each part First part: Definitions Definition of tokens for the second part and for use by flex Definition of variables for use by the parser code Second part: Rules Grammar for the parser Third part: User code The code in this part is copied into the parser generated by YACC
%{ #include <string.h> #include <stdio.h> struct table *tl, *t; struct configuration *c; /* define a linked list of table names */ struct table { char *table_name; struct table *next; }; /* define a structure for the configuration information */ struct configuration { char *host; int port; struct table *tlist; char *data_dir; }; Configuration file parser Milestone 1 Definition section config_parser.y
Configuration file parser Milestone 1 %} %union{ char *sval; // String value (user defined) int pval; // Port number value (user defined) } %token <sval> STRING %token <pval> PORT_NUMBER %token HOST_PROPERTY PORT_PROPERTY DDIR_PROPERTY TABLE %% Definition section cont’d. config_parser.y
property_list: HOST_PROPERTY STRING PORT_PROPERTY NUMBER table_list data_directory ; table_list: table_list TABLE STRING | TABLE STRING ; data_directory: DDIR_PROPERTY STRING ; %% Configuration file parser Milestone 1 (Grammar) Rules section (simplified) config_parser.y
struct configuration { char *host; int port; struct table *tlist; char *data_dir; }; struct configuration *c; data_directory: DDIR_PROPERTY STRING { c = (struct configuration *) malloc(sizeof(struct configuration)); // Check c for NULL c->data_dir = strdup( $2 ); } ; $1 $2 (Grammar) Rules section (details) config_parser.y
struct configuration { char *host; int port; struct table *tlist; char *data_dir; }; struct configuration *c; property_list: HOST_PROPERTY STRING PORT_PROPERTY PORT_NUMBER table_list data_directory { c->host = strdup( $2 ); c->port = $4; c->tlist = tl; } ; (Grammar) Rules section (details) config_parser.y
property_list: HOST_PROPERTY STRING PORT_PROPERTY NUMBER table_list data_directory ; table_list: table_list TABLE STRING | TABLE STRING ; data_directory: DDIR_PROPERTY STRING ; %% Configuration file parser Milestone 1 … TABLE STRING TABLE STRING (Grammar) Rules section (simplified) config_parser.y
table_list is a recursive rule • Example table specification in configuration file table MyCourses table MyMarks table MyFriends • table_list: table_list TABLE STRING | TABLE STRING ; • Terminology • table_list is called a non-terminal • TABLE & STRING are terminals
Recursive rule execution table_list : table_list TABLE STRING table_list TABLE STRING TABLE STRING TABLE STRING TABLE STRING TABLE STRING • table MyCourses • table MyMarks • table MyCourses • table MyFriends • table MyMarks • table MyCourses • table MyCourses • table MyMarks • table MyFriends table_list: table_list TABLE STRING | TABLE STRING ;
struct table { char *table_name; struct table *next; }; struct table *tl, *t; table_list: table_list TABLE STRING { t = (struct table *) malloc(sizeof(struct table)); t->table_name = strdup( $3 ); t->next = tl; tl = t; } | TABLE STRING { tl = (struct table *) malloc(sizeof(struct table)); tl->table_name = strdup( $2 ); tl->next = NULL; } ; $1 $2 $3 t->next = tl table tl = t $1 $2 tl->next = NULL tl table config_parser.y
How to invoke the parser int main (int argc, char **argv){ FILE *f; extern FILE *yyin; if (argc == 2) { f = fopen(argv[1],"r"); if (!f){ …// error handling …} yyin = f; while( ! feof(yyin) ) { if (yyparse() != 0) { … yyerror(""); exit(0); }; } fclose(f); } … • yylex() for calling generated scanner • by default called within yyparse()
In the Makefile lexer: config_parser.l ${LEX} config_parser.l ${CC} ${CFLAGS} ${INCLUDE} -c lex.yy.c yaccer: config_parser.y ${YACC} -d config_parser.y ${CC} ${CFLAGS} ${INCLUDE} -c config_parser.tab.c parser: config_parser.tab.o lex.yy.o ${CC} ${CFLAGS} ${INCLUDE} -c parser.c ${CC} -o p ${CFLAGS} ${INCLUDE} lex.yy.o \ config_parser.tab.o \ parser.o
Benefits • Faster development • Compared to manual implementation • Easier to change the specification and generate new parser • Than to modify 1000s of lines of code to add, change, delete an existing feature • Less error-prone, as code is generated • Cost: Learning curve • Invest once, amortized over 40+ years career
If you want to know more • Lecture, examples and some recommended reading are enough to tackle all of the parsing for Milestone 3 & 4 • 3rd and 4th year lectures on Compilers may show you the algorithms behind & inside Lex & YACC • Lectures on Computability and Theory of Computation may also show you these algorithms
A flex specification %{ #include <stdio.h #include "y.tab.h" int c; extern int yylval; %} %% " " ; [a-z] { c = yytext[0]; yylval = c - 'a'; return(LETTER); } [0-9] { c = yytext[0]; yylval = c - '0'; return(DIGIT); } [^a-z0-9\b] { c = yytext[0]; return(c); } The Header The “Guts”: Regular expressions annotated with actions
The header %{ #include <stdio.h #include "y.tab.h" int c; extern int yylval; %} %% Temporary variable(s) • Special variable • defined in scanner • used in parser • for transferring • values associated • with tokens to parser dividing line between header and rules section
The rules %% " " ; [a-z] { c = yytext[0]; yylval = c - 'a'; return (LETTER); } [0-9] { c = yytext[0]; yylval = c - '0'; return (DIGIT); } [^a-z0-9\b] { c = yytext[0]; return(c); } yytext: the string associated with the token the string associated with the token the string associated with the token
The rules sets yylval to the character’s alphabetical order %% " " ; [a-z] { c = yytext[0]; yylval = c - 'a'; return(LETTER); } [0-9] { c = yytext[0]; yylval = c - '0'; return(DIGIT); } [^a-z0-9\n] { c = yytext[0]; return(c); } sets yylval to digit’s numerical value otherwise simply returns that character; presumably it’s an operator: +*-, etc.
Simple example Implement a calculator which can recognize adding or subtracting of numbers • [linux33]% ./y_calc • 1+101 • = 102 • [linux33] % ./y_calc • 1000-300+200+100 • = 1000 • [linux33] %
Example – the Lex part %{ #include <math.h> #include "y.tab.h" extern int yylval; %} %% [0-9]+ { yylval = atoi(yytext); return NUMBER; } [\t ]+ ; /* Do nothing for white space */ \n return 0;/* End of the logic */ . return yytext[0]; %% Definitions pattern action Rules
Example – the Yacc part %token NAME NUMBER %% statement: NAME '=' expression | expression { printf("= %d\n", $1); } ; expression:expression '+' NUMBER { $$ = $1 + $3; } |expression '-' NUMBER { $$ = $1 - $3; } | NUMBER { $$ = $1; } ; Definitions Include Yacc library (-ly) Rules