1 / 46

Scanning & Parsing with Lex and YACC

Submissions: 99 Average for A2: 71% Early submission bonus: 1 Full marks: 5 16 teams attempted nonce bonus 7 got full marks 7 teams attempted ACC bonus 7 got full marks. Can we generate code to support mundane coding tasks and safe time?. Scanning & Parsing with Lex and YACC.

gautam
Download Presentation

Scanning & Parsing with Lex and YACC

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Submissions: 99 • Average for A2: 71% • Early submission bonus: 1 • Full marks: 5 • 16 teams attempted nonce bonus • 7 got full marks • 7 teams attempted ACC bonus • 7 got full marks Can we generate code to support mundane coding tasks and safe time? Scanning & Parsing with Lex and YACC Give you an example for Milestone 1. Hans-Arno Jacobsen ECE 297 Powerful, but not easy

  2. CoursePeer – try it out! • Developed by a former ECE297 student • Many of the videos under tips & tricks are from him too • Short video about CoursePeer • To sign up and auto-enrol under ECE297, use this link • http://www.crspr.com/?rid=339 • Will have a quick demo and use it on Wednesday for our Q&A session

  3. Know your tools! • Can we generate code based on a specification of what we want? • Is the specification simpler than writing a program for doing the same task? • Fully automated program generation has been a dream since the early days of computing.

  4. Where do we need parsing in the storage server?

  5. Where do we need parsing in the storage server? • Configuration file (file) • Bulk loading of data files (file) • Protocol messages (network) • Command line arguments (string)

  6. Parsing PROPERTY VALUE PROPERTY VALUE (TABLE TABLE-NAME)+ PROPERTY VALUE server_host localhost server_port 1111 table marks data_directory ./data Tokens default.conf – the way the disk may see it server_host localhost \n server_port 1111 \n table marks \n # This data directory may be an absolute or relative path. \n data_directory ./data \n\n\n \EOF

  7. ScenariosWhere we’d like to safe time in writing a quick language processor? Conceptually speaking In our storage servers Languages Data schema & data Query language Output formatting (Web, Latex, PDF, Word, Excel) Storage server configuration Benchmarking • Languages • Data description language • Script language • Markup language • System configurations • Workload generation

  8. Parser generation from 30K feet Written by developer Specification Specification Generator Generated code Generator Other code Written by developer Other code Compiler / Linker Execut- able

  9. Scanning & parsing I server_host localhost \n server_port 1111 \n table marks \n # This data PROPERTY VALUE PROPERTY VALUE … Scanning PROPERTY VALUE PROPERTY VALUE (TABLE TABLE-NAME)+ PROPERTY VALUE Parsing Verify content, add to data structures, … Processing

  10. Regular expressions Patterns • (TABLE TABLE-NAME)+ • TABLE TABLE-NAME • TABLE TABLE-NAME TABLE TABLE-NAME • … • Regular expressions (formal languages) • Extended regular expressions (UNIX)

  11. Scanning & parsing II • Parsing is really two steps • Scanning (a.k.a. tokenizing or lexical analysis) • Parsing, i.e., analysis of structure and syntax according to a grammar (i.e., a set of rules) • flex is the scanner generator (open source) • Fast Lex for lexical analysis • YACC is the parser generator • Yet Another Compiler Compiler for structural and syntax analysis • Lex and YACC work together • Generated scanner drives the generated parser • We use flex (fast Lex) and Bison (GNU YACC) • There are myriads of other tools for Java, C++, …, some of which combine Lex/Yacc into one tool (e.g., javacc)

  12. Objectives for today • Cover the basics of Lex & Yacc • Everybody should have an appreciation of the potential of these tools • There is a lot more detail that remains unsaid • To challenge you

  13. Lex & YACC overview server_host localhost \n server_port 1111 \n table marks \n # This data directory may be an absolute or relative path. \n data_directory ./data \n\n\n \EOF Lexical Analyzer input stream token stream PROPERTY VALUE PROPERTY VALUE Output defined by actions in parser specification (often an in-memory representation of input) Structural Analyzer token stream

  14. Lexical Analysis with Lex

  15. Synonyms: lexical analyzer, scanner, lexer, tokenizer Lex introduction flex is fastLex Input specification (*.l) flex You can control the name of generated file lex.yy.c C compiler Lexical Analyzer input stream token stream You generate the lexical analyzer by using flex

  16. Lex Input specification for lex – the “program” Three parts: Definitions, Rules, User code Use “%%” as a delimiter for each part First part: Definitions Options used by flex inside the scanner Defines variables & macros Code within “%{” and “%}” directly copied into the scanner (e.g., global variables, header files) Second part: Rules Patterns and corresponding actions Actions are executed when corresponding pattern(s) matches Patterns are defined by regular expressions

  17. %{ #include "config_parser.tab.h" ... %} a2Z [a-zA-Z] host server_host port server_port dir data_directory %% {host} { return HOST_PROPERTY; } {port} { return PORT_PROPERTY; } table { return TABLE; } {dir} { return DDIR_PROPERTY; } [\t\n ]+ { } #.*\n { } {a2Z}* { yylval.sval = strdup(yytext); return STRING; } [0-9]+ { yylval.pval = (int) atoi(yytext); return PORT_NUMBER; } . { return yytext[0]; } … Parsing the configuration file of Milestone 1 Pattern Action Shorthands for use below config_parser.l

  18. flex pattern matching principles • Actions are executed when patterns match • Tokens are returned to caller; next pattern … • Patterns match a given input character or string only once • Input stream is consumed • flex executes the action for the longest possible matching input • Order of patterns in the spec. is important

  19. flex regular expressions by example I(Really: extended regular expressions) `x‘ match the character 'x' `.‘ any character (byte) except newline `[xyz]’ match either an 'x', a 'y', or a 'z' `[abj-oZ]‘ match an 'a', a 'b', any letter from 'j' through 'o', or a 'Z‘ `[^A-Z]‘ a "negated character class", i.e., any character EXCEPT those in the class `[^A-Z\n]’ any character EXCEPT an uppercase letter or a newline

  20. flex regular expression by example II r is any regular expression `r*‘ zero or more r's, where r is any regular expression `r+‘ one or more r's `r?‘ zero or one r (that is, “an optional r”) ‘r{2,5}‘ anywhere from two to five r's `r{2,}‘ two or more r's `r{4}‘ exactly 4 r's ‘<<EOF>>' an end-of-file

  21. flex regular expressions • There are many more expressions, see manual • Form complex expressions • E.g.: IP address, names, … • The expression syntax is used in other tools as well (well worth learning)

  22. %{ #include "config_parser.tab.h" ... %} a2Z [a-zA-Z] host server_host port server_port dir data_directory %% {host} { return HOST_PROPERTY; } {port} { return PORT_PROPERTY; } table { return TABLE; } {dir} { return DDIR_PROPERTY; } [\t\n ]+ { } #.*\n { } {a2Z}* { yylval.sval = strdup(yytext); return STRING; } [0-9]+ { yylval.pval = (int) atoi(yytext); return PORT_NUMBER; } . { return yytext[0]; } <<EOF>> { return 0; } Parsing the configuration file of Milestone 1 User-defined variable in YACC (conveys token value to YACC) server_hostlocalhost server_port 1111 table marks data_directory ./data config_parser.l

  23. Parsing with Yacc

  24. YACC introducing You can control the name of generated file Input specification (*.y) YACC y.tab.c C compiler Output defined by actions in parser specification Syntax analyzer / parser token stream, e.g., via flex From the specified grammar, YACC generates a parser which recognizes “sentences” according to the grammar

  25. YACC Input specification for YACC (similar to flex) Three parts: Definitions, Rules, User code Use “%%” as a delimiter for each part First part: Definitions Definition of tokens for the second part and for use by flex Definition of variables for use by the parser code Second part: Rules Grammar for the parser Third part: User code The code in this part is copied into the parser generated by YACC

  26. %{ #include <string.h> #include <stdio.h> struct table *tl, *t; struct configuration *c; /* define a linked list of table names */ struct table { char *table_name; struct table *next; }; /* define a structure for the configuration information */ struct configuration { char *host; int port; struct table *tlist; char *data_dir; }; Configuration file parser Milestone 1 Definition section config_parser.y

  27. Configuration file parser Milestone 1 %} %union{ char *sval; // String value (user defined) int pval; // Port number value (user defined) } %token <sval> STRING %token <pval> PORT_NUMBER %token HOST_PROPERTY PORT_PROPERTY DDIR_PROPERTY TABLE %% Definition section cont’d. config_parser.y

  28. property_list: HOST_PROPERTY STRING PORT_PROPERTY NUMBER table_list data_directory ; table_list: table_list TABLE STRING | TABLE STRING ; data_directory: DDIR_PROPERTY STRING ; %% Configuration file parser Milestone 1 (Grammar) Rules section (simplified) config_parser.y

  29. struct configuration { char *host; int port; struct table *tlist; char *data_dir; }; struct configuration *c; data_directory: DDIR_PROPERTY STRING { c = (struct configuration *) malloc(sizeof(struct configuration)); // Check c for NULL c->data_dir = strdup( $2 ); } ; $1 $2 (Grammar) Rules section (details) config_parser.y

  30. struct configuration { char *host; int port; struct table *tlist; char *data_dir; }; struct configuration *c; property_list: HOST_PROPERTY STRING PORT_PROPERTY PORT_NUMBER table_list data_directory { c->host = strdup( $2 ); c->port = $4; c->tlist = tl; } ; (Grammar) Rules section (details) config_parser.y

  31. property_list: HOST_PROPERTY STRING PORT_PROPERTY NUMBER table_list data_directory ; table_list: table_list TABLE STRING | TABLE STRING ; data_directory: DDIR_PROPERTY STRING ; %% Configuration file parser Milestone 1 … TABLE STRING TABLE STRING (Grammar) Rules section (simplified) config_parser.y

  32. table_list is a recursive rule • Example table specification in configuration file table MyCourses table MyMarks table MyFriends • table_list: table_list TABLE STRING | TABLE STRING ; • Terminology • table_list is called a non-terminal • TABLE & STRING are terminals

  33. Recursive rule execution table_list : table_list TABLE STRING table_list TABLE STRING TABLE STRING TABLE STRING TABLE STRING TABLE STRING • table MyCourses • table MyMarks • table MyCourses • table MyFriends • table MyMarks • table MyCourses • table MyCourses • table MyMarks • table MyFriends table_list: table_list TABLE STRING | TABLE STRING ;

  34. struct table { char *table_name; struct table *next; }; struct table *tl, *t; table_list: table_list TABLE STRING { t = (struct table *) malloc(sizeof(struct table)); t->table_name = strdup( $3 ); t->next = tl; tl = t; } | TABLE STRING { tl = (struct table *) malloc(sizeof(struct table)); tl->table_name = strdup( $2 ); tl->next = NULL; } ; $1 $2 $3 t->next = tl table tl = t $1 $2 tl->next = NULL tl table config_parser.y

  35. How to invoke the parser int main (int argc, char **argv){ FILE *f; extern FILE *yyin; if (argc == 2) { f = fopen(argv[1],"r"); if (!f){ …// error handling …} yyin = f; while( ! feof(yyin) ) { if (yyparse() != 0) { … yyerror(""); exit(0); }; } fclose(f); } … • yylex() for calling generated scanner • by default called within yyparse()

  36. In the Makefile lexer: config_parser.l ${LEX} config_parser.l ${CC} ${CFLAGS} ${INCLUDE} -c lex.yy.c yaccer: config_parser.y ${YACC} -d config_parser.y ${CC} ${CFLAGS} ${INCLUDE} -c config_parser.tab.c parser: config_parser.tab.o lex.yy.o ${CC} ${CFLAGS} ${INCLUDE} -c parser.c ${CC} -o p ${CFLAGS} ${INCLUDE} lex.yy.o \ config_parser.tab.o \ parser.o

  37. Benefits • Faster development • Compared to manual implementation • Easier to change the specification and generate new parser • Than to modify 1000s of lines of code to add, change, delete an existing feature • Less error-prone, as code is generated • Cost: Learning curve • Invest once, amortized over 40+ years career

  38. If you want to know more • Lecture, examples and some recommended reading are enough to tackle all of the parsing for Milestone 3 & 4 • 3rd and 4th year lectures on Compilers may show you the algorithms behind & inside Lex & YACC • Lectures on Computability and Theory of Computation may also show you these algorithms

  39. A flex specification %{ #include <stdio.h #include "y.tab.h" int c; extern int yylval; %} %% " " ; [a-z] { c = yytext[0]; yylval = c - 'a'; return(LETTER); } [0-9] { c = yytext[0]; yylval = c - '0'; return(DIGIT); } [^a-z0-9\b] { c = yytext[0]; return(c); } The Header The “Guts”: Regular expressions annotated with actions

  40. The header %{ #include <stdio.h #include "y.tab.h" int c; extern int yylval; %} %% Temporary variable(s) • Special variable • defined in scanner • used in parser • for transferring • values associated • with tokens to parser dividing line between header and rules section

  41. The rules %% " " ; [a-z] { c = yytext[0]; yylval = c - 'a'; return (LETTER); } [0-9] { c = yytext[0]; yylval = c - '0'; return (DIGIT); } [^a-z0-9\b] { c = yytext[0]; return(c); } yytext: the string associated with the token the string associated with the token the string associated with the token

  42. The rules sets yylval to the character’s alphabetical order %% " " ; [a-z] { c = yytext[0]; yylval = c - 'a'; return(LETTER); } [0-9] { c = yytext[0]; yylval = c - '0'; return(DIGIT); } [^a-z0-9\n] { c = yytext[0]; return(c); } sets yylval to digit’s numerical value otherwise simply returns that character; presumably it’s an operator: +*-, etc.

  43. Simple example Implement a calculator which can recognize adding or subtracting of numbers • [linux33]% ./y_calc • 1+101 • = 102 • [linux33] % ./y_calc • 1000-300+200+100 • = 1000 • [linux33] %

  44. Example – the Lex part %{ #include <math.h> #include "y.tab.h" extern int yylval; %} %% [0-9]+ { yylval = atoi(yytext); return NUMBER; } [\t ]+ ; /* Do nothing for white space */ \n return 0;/* End of the logic */ . return yytext[0]; %% Definitions pattern action Rules

  45. Example – the Yacc part %token NAME NUMBER %% statement: NAME '=' expression | expression { printf("= %d\n", $1); } ; expression:expression '+' NUMBER { $$ = $1 + $3; } |expression '-' NUMBER { $$ = $1 - $3; } | NUMBER { $$ = $1; } ; Definitions Include Yacc library (-ly) Rules

More Related