190 likes | 217 Views
Learn how parsers read input streams to match grammar, make proper choices, and avoid backtracking in JavaCC. Explore modified grammars and default choice-determination algorithm.
E N D
Looking ahead in javacc 2/28/06
The job of a parser is to read an input stream and determine whether or not the input stream is in the grammar. This can be quite time consuming. Consider the following example void Input() : {} { "a" BC() "c" } void BC() : {} { "b" [ "c" ] } What’s LOOKAHEAD? What strings are matched?
Matching “abc” Step 1 Starting; there’s only one choice here - the char must be 'a' –which it is, so OK. Step 2 Proceeding to non-terminal BC; again, there’s only one choice for the next input character - it must be 'b'. This is in line w/ the input - fine Step 3 We now come to a "choice point" in the grammar. We can either go inside the [...] and match it, or ignore it altogether. We decide to go inside. So the next input character must be a 'c'. We are again OK. Step 4 Now we have completed with non-terminal BC and go back to non-terminal Input. Now the grammar says the next character must be yet another 'c'. But there are no more input characters. So we have a problem.
Steps Continued Step 5. In the general case, we conclude a bad choice happened somewhere. In this case, we made the bad choice in Step 3; so backtrack to step 3 and make another choice. Step 6. We have now backtracked and made the other choice we could have made at Step 3 - namely, ignore the [...]. Now we have completed with non-terminal BC and go back to non-terminal Input. Now the grammar says the next character must be yet another 'c'. The next input character is a 'c', so we are OK now. Step 7. We realize we have reached the end of the grammar (end of non-terminal Input) successfully. This means we have successfully matched the string "abc" to the grammar. Backtracking is to be avoided!
Rethinking • The amount of time taken is a function of how the grammar is written. • Many grammars can be written to cover the same set of inputs - or the same language (i.e., there can be multiple equivalent grammars for the same input language). • What about the grammar above?
What can be said of these? void Input() : {} { "a" "b" "c" [ "c" ] } Good void Input() : {} { "a" ( BC1() | BC2() ) } void BC1() : {} { "b" "c" "c" } void BC2() : {} { "b" "c" [ "c" ] } Ugly void Input() : {} { "a" "b" "c" "c" | "a" "b" "c" } Bad
Looking Ahead • Backtracking performance is unacceptable so most parsers don’t backtrack in this general manner (if at all), rather they make decisions at choice points based on limited information and then commit to it. • Parsers generated by javacc make decisions at choice points based on some exploration of tokens further ahead in the input stream, and once they make such a decision, they commit to it. i.e.,No backtracking is performed once a decision is made. • The process of exploring tokens further in the input stream is termed "looking ahead" into the input stream - hence our use of the term "LOOKAHEAD". • Since some of these decisions may be made with less than perfect information you need to know something about LOOKAHEAD to make your grammar work correctly. • The two ways in which you make the choice decisions work properly are: • . Modify the grammar to make it simpler. • . Insert hints at the more complicated choice points to help the parser make the right choices.
Four Choice Points in javacc • An expansion of the form: ( exp1 | exp2 | ... ). In this case, the generated parser has to somehow determine which of exp1, exp2, etc. to select to continue parsing. • . An expansion of the form: ( exp )?. In this case, the generated parser must somehow determine whether to choose exp or to continue beyond the ( exp )? without choosing exp. • An expansion of the form ( exp )*. In this case, the generated parser must do the same thing as in the previous case, and furthermore, after each time a successful match of exp (if exp was chosen) is completed, this choice determination must be made again. • An expansion of the form ( exp )+. This is essentially similar to the previous case with a mandatory first match to exp
The default choice determination algorithm looks ahead 1 token in the input stream and uses this to help make its choice at choice points void basic_expr() : {} { <ID> "(" expr() ")" // Choice 1 | "(" expr() ")" // Choice 2 | "new" <ID> // Choice 3 } The Default Algo The choice determination algorithm : if (next token is <ID>) { choose Choice 1 } else if (next token is "(") { choose Choice 2 } else if (next token is "new") { choose Choice 3 } else { produce an error message }
A Modified Grammar void basic_expr() : {} { <ID> "(" expr() ")“ // Choice 1 | "(" expr() ")" // Choice 2 | "new" <ID> // Choice 3 | <ID> "." <ID> // Choice 4 } What happans on <ID>? Why? Warning: Choice conflict involving two expansions at line 25, column 3 and line 31, column 3 respectively. A common prefix is: <ID> Consider using a lookahead of 2 for earlier expansion.
Another example void identifier_list() : {} { <ID> ( "," <ID> )* } • Suppose the first <ID> has already been matched and that the parser has reached the choice point (the (...)* construct). Here's how the choice determination algorithm works: while (next token is ",") { choose the nested expansion (i.e., go into the (...)* construct) consume the "," token if (next token is <ID>) consume it, otherwise report error } Note: the choice determination algorithm does not look beyond the (...)*
What to do here? • When the default algorithm is making a choice at ( "," <ID> )*, it will always go into the (...)* construct if the next token is a ",". • It will do this even when identifier_list was called from funny_list and the token after the "," is an <INT>. • Intuitively, the right thing to do in this situation is to skip the (...)* construct and return to funny_list void identifier_list() : {} { <ID> ( "," <ID> )* } void funny_list() : {} { identifier_list() "," <INT> }
A Concrete example Consider "id1, id2, 5", the parser will complain that it encountered a 5 when it was expecting an <ID>. Note - when you built the parser, it would have given you the following warning message: Warning: Choice conflict in (...)* construct at line 25, column 8. Expansion nested within construct and expansion following constructhave common prefixes, one of which is: ",“ Consider using a lookahead of 2 or more for nested expansion. Essentially, JavaCC is saying it has detected a situation in your grammar which may cause the default lookahead algorithm to do strange things. The generated parser will still work using the default lookahead algorithm - except that it probably doesn’t do what you expect
Multiple Token Lookaheads Specs • In the majority of situations, the default algorithm works just fine. In situations where it does not work well, javacc provides you with warning messages likethe ones shown above. • If you have javacc file without producing any warnings, then the grammar is a LL(1) grammar. • Essentially, LL(1) grammars are those that can be handled by top-down parsers (such as those generatedby javacc using at most one token of LOOKAHEAD. • There are two options for lookaheads
LL(1)? • When you derive table multiple entries in a row/column indicated an error • See www.cs.usfca.edu/galles/cs414/lecture/lecture3.java.pdf
Option 1 - Modify your grammar • You can modify your grammar so that the warning messages go away. That is, you can attempt to make your grammar LL(1) by making some changes to it void basic_expr() : {} { <ID> "(" expr() ")“ // Choice 1 | "(" expr() ")" // Choice 2 | "new" <ID> // Choice 3 | <ID> "." <ID> // Choice 4 } void basic_expr() : {} { <ID> ( "(" expr() ")" | "." <ID> ) | "(" expr() ")" | "new" <ID> } Factor
Option 2 – Provide “Hints” • You can provide the generated parser with some hints to help it out in the non-LL(1) situations that the warning messages bring to your attention. • All such hints are specified using either setting the global LOOKAHEAD value to a larger value or by using the LOOKAHEAD(...) construct to provide a local hint. • Picking Option 1 or Option 2 is often a design decision However • Option 1 makes your grammar perform better. JavaCC generated parsers can handle LL(1) constructs much faster than other constructs. • Option 2 is that you have a simpler grammar - one that is easier to develop and maintain - one that focuses on human-friendliness and not machine-friendliness. • Sometimes Option 2 is the only choice - especially in the presence of user actions. void basic_expr() : {} { { initMethodTables(); } <ID> "(" expr() ")" | "(" expr() ")" | "new" <ID> | { initObjectTables(); } <ID> "." <ID> } • Since the actions are different, left-factoring cannot be performed.
Global Option LOOKAHEAD void basic_expr() : {} { LOOKAHEAD(2) <ID> "(" expr() ")"// Choice 1 | "(" expr() ")" // Choice 2 | "new" <ID> // Choice 3 | <ID> "." <ID> // Choice 4 } if (next 2 tokens are <ID> and "(" ) { choose Choice 1 } else if (next token is "(") { choose Choice 2 } else if (next token is "new") { choose Choice 3 } else if (next token is <ID>) { choose Choice 4 } else { produce an error message }
References • https://javacc.dev.java.net/doc/lookahead.html