1 / 76

Agenda

Agenda. Scanner vs. parser Regular grammar vs. context-free grammar Grammars (context-free grammars) grammar rules derivations parse trees ambiguous grammars useful examples Reading: Chapter 2, 4.1 and 4.2 ,. Characteristics of a Parser. Input: sequence of tokens from scanner

Download Presentation

Agenda

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Agenda • Scanner vs. parser • Regular grammar vs. context-free grammar • Grammars (context-free grammars) • grammar rules • derivations • parse trees • ambiguous grammars • useful examples • Reading: • Chapter 2, 4.1 and 4.2 , CPSC4600

  2. Characteristics of a Parser • Input: sequence of tokens from scanner • Output: parse tree of the program • parse tree is generated (implicitly or explicitly) if the input is a legal program • if input is an illegal program, syntax errors are issued • Note: • Instead of parse tree, some parsers produce directly: • abstract syntax tree (AST) + symbol table , or • intermediate code, or • object code • In the following lectures, we’ll assume that parse tree is generated. CPSC4600

  3. Comparison with Lexical Analysis CPSC4600

  4. Example • The program: • x * y + z • Input to parser: • ID TIMES ID PLUS ID • we’ll write tokens as follows: • id * id + id • Output of parser: • the parse tree  E E + E E * E id id id CPSC4600

  5. Why are Regular Grammars Not Enough? Write an automaton that accepts strings • “a”, “(a)”, “((a))”, and “(((a)))” • “a”, “(a)”, “((a))”, “(((a)))”, …“(ka)k” CPSC4600

  6. What must parser do? • Recognizer: not all strings of tokens are programs • must distinguish between valid and invalid strings of tokens • Translator: must expose program structure • e.g., associativity and precedence • hence must return the parse tree We need: • A language for describing valid strings of tokens • context-free grammars • (analogous to regular grammars in the scanner) • A method for distinguishing valid from invalid strings of tokens (and for building the parse tree) • the parser • (analogous to the state machine in the scanner) CPSC4600

  7. Context-free grammars (CFGs) • Example: Simple Arithmetic Expressions Grammar • In English: • An integer is an arithmetic expression. • If exp1 and exp2 are arithmetic expressions, then so are the following: exp1 - exp2 exp1 / exp2 ( exp1 ) • the corresponding CFG: we’ll write tokens as follows: exp INTLITERAL E  intlit exp  exp MINUS exp E  E - E exp  exp DIVIDE exp E  E / E exp  LPAREN exp RPAREN E  ( E ) CPSC4600

  8. Reading the CFG • The grammar has five terminal symbols: • intlit, -, /, (, ) • terminals of a grammar = tokens returned by the scanner. • The grammar has one non-terminal symbol: • E • non-terminals describe valid sequences of tokens • The grammar has four productions or rules, • each of the form: E  • left-hand side = a single non-terminal. • right-hand side = either • a sequence of one or more terminals and/or non-terminals, or •  (an empty production); CPSC4600

  9. Example, revisited • Note: • a more compact way to write previous grammar: E  INTLITERAL | E - E | E / E | ( E ) or E  INTLITERAL | E - E | E / E | ( E ) CPSC4600

  10. A formal definition of CFGs • A CFG consists of • A set of terminals T • A set of non-terminals N • A start symbolS (a non-terminal) • A set of productions: • X  X1 X2 … Xn • where X  N and Yi  T U N U {} CPSC4600

  11. Notational Conventions • In these lecture notes • Non-terminals are written upper-case • Terminals are written lower-case • The start symbol is the left-hand side of the first production CPSC4600

  12. The Language of a CFG The language defined by a CFG is the set of strings that can be derived from the start symbol of the grammar. Derivation: Read productions as rules: X  Y1… Yn  Means X can be replaced by Y1… Yn CPSC4600

  13. Derivation: key idea 1. Begin with a string consisting of the start symbol “S” 2. Replace any non-terminal Xin the string by a the right-hand side of some production 3. Repeat (2) until there are no non-terminals in the string CPSC4600

  14. Derivation: an example derivation: CFG: E  id E  E + E E  E * E E  ( E ) Is string id * id + id in the language defined by the grammar? CPSC4600

  15. Terminals • Terminals are called so because there are no rules for replacing them • Once generated, terminals are permanent • Therefore, terminals are the tokens of the language CPSC4600

  16. The Language of a CFG (Cont.) More formally, write X1 X2… Xn X1 X2… X i-1 Y1 Y2… Ym X i+1… Xn if there is a production X i Y1 Y2… Ym CPSC4600

  17. The Language of a CFG (Cont.) Write X1 X2… Xn* Y1 Y2… Ym if X1 X2… Xn… .. Y1 Y2… Ym in 0 or more steps CPSC4600

  18. The Language of a CFG Let G be a context-free grammar with start symbol S. Then the language of Gis: {a1 a2… an | S * a1 a2… an} where ai, i= 1,2, .., n are terminal symbols CPSC4600

  19. Examples Strings of balanced parentheses The grammar: sameas CPSC4600

  20. Arithmetic Expression Example Simple arithmetic expressions: Some elements of the language: CPSC4600

  21. Notes The idea of a CFG is a big step. But: • Membership in a language is “yes” or “no” • we also need parse tree of the input! • furthermore, we must handle errors gracefully • Need an “implementation” of CFG’s, • i.e. the parser • we’ll create the parser using a parser generator • available generators: CUP, bison, yacc CPSC4600

  22. More Notes • Form of the grammar is important • Many grammars generate the same language • Parsers are sensitive to the form of the grammar • Example: E  E + E | E – E | intlit is not suitable for an LL(1) parser (a common kind of parser). CPSC4600

  23. Derivations and Parse Trees A derivation is a sequence of productions S ..  ..  .. A derivation can be drawn as a tree • Start symbol is the tree’s root • For a production X  Y1 Y2 add children Y1 Y2 to node X CPSC4600

  24. Derivation Example • Grammar • String CPSC4600

  25. Derivation Example (Cont.) E E + E E * E id id id CPSC4600

  26. Notes on Derivations • A parse tree has • Terminals at the leaves • Non-terminals at the interior nodes • An in-order traversal of the leaves is the original input • The parse tree shows the association of operations, the input string does not CPSC4600

  27. Left-most and Right-most Derivations • The example is a left-most derivation • At each step, replace the left-most non-terminal • There is an equivalent notion of a right-most derivation CPSC4600

  28. Derivations and Parse Trees • Note that right-most and left-most derivations have the same parse tree • The difference is the order in which branches are added CPSC4600

  29. Remarks on Derivation • We are not just interested in whether s e L(G) • We need a parse tree for s, (because we need to build the AST) • A derivation defines a parse tree • But one parse tree may have many derivations • Left-most and right-most derivations are important in parser implementation CPSC4600

  30. Ambiguity(1) • Grammar • String CPSC4600

  31. Ambiguity (2) This string has two parse trees E E E + E E E * E E id id E + E * id id id id CPSC4600

  32. Ambiguity(3) • for each of the two parse trees, find the corresponding left-most derivation • for each of the two parse trees, find the corresponding right-most derivation CPSC4600

  33. Ambiguity (4) • A grammar is ambiguousif, for some string of the language • it has more than one parse tree, or • there is more than one right-most derivation, or • there is more than one left-most derivation. (the three conditions are equivalent) Ambiguity Leaves meaning of some programs ill-defined CPSC4600

  34. Dealing with Ambiguity • There are several ways to handle ambiguity • Most direct method is to rewrite grammar unambiguously • Enforces precedence of * over + CPSC4600

  35. Removing Ambiguity • Rewriting: • Expression Grammars • precedence • associativity • IF-THEN-ELSE • the Dangling-ELSE problem CPSC4600

  36. Handling operator precedence • Rewrite the grammar • use a different nonterminal for each precedence level • start with the lowest precedence (MINUS) E  E - E | E / E | ( E ) | id rewrite to E  E - T | T T  T / F | F F  id | ( E ) CPSC4600

  37. Example E parse tree for id – id / id E  E - T | T T  T / F | F F  id | ( E ) T E - T / T F F F id id id CPSC4600

  38. Handling Operator Associativity • The grammar captures operator precedence, but it is still ambiguous! • fails to express that both subtraction and division are left associative; • e.g., 5-3-2 is equivalent to: ((5-3)-2) and not to: (5-(3-2)). CPSC4600

  39. Recursion • A grammar is recursive in nonterminal X if: • X +…X … • + means“in one or more steps, X derives a sequence of symbols that includes an X” • A grammar is left recursive in X if: • X + X … • in one or more steps, X derives a sequence of symbols that starts with an X • A grammar is right recursive in X if: • X +… X • in one or more steps, X derives a sequence of symbols that ends with an X CPSC4600

  40. Resolving ambiguity due to associativity • The grammar given above is both left and right recursive in nonterminals E and T • To correctly expresses operator associativity: • For left associativity, use left recursion. • For right associativity, use right recursion. • Here's the correct grammar: E  E – T | T T  T / F | F F  id | ( E ) CPSC4600

  41. The Dangling “Else” ambiguity • Consider the grammar St  if E then St | if E then St else St | other • This grammar is also ambiguous CPSC4600

  42. Resolving the “dangling else” • else matches the closest unmatched then • We can describe this in the grammar E  MIF /* all then are matched */ | UIF /* some then are unmatched */ MIF  if E then MIF else MIF | print UIF  if E then E | if E then MIF else UIF • Describes the same set of strings CPSC4600

  43. Precedence and Associativity Declarationsin Parser Generators • Instead of rewriting the grammar • Use the more natural (ambiguous) grammar • Along with disambiguating declarations • Most parser generators allow precedence and associativity declarations to disambiguate grammars CPSC4600

  44. Parsing Approaches • Top-down parsing • build parse tree from start symbol (root) • match terminal symbols(tokens) in the production rules with tokens in the input stream • simple but limited in power • Bottom-up parsing • start from input token stream • build parse tree from terminal symbols (tokens) until get start symbol • complex but powerful CPSC4600

  45. Top Down vs.Bottom Up start here result match start here result input token stream input token stream Top-down Parsing Bottom-up Parsing CPSC4600

  46. Top-down Parsing A top-down parsing algorithm parses an input string of tokens by tracing out the steps in a leftmost derivation. The parse tree associated with the input string is constructed using preorder traversal and hence the name “top-down”. CPSC4600

  47. Top-down parsers There are mainly two kinds of top-down parsers: 1. Predictive parsers - Tries tomake decisions about the structure of the tree below a node based on a few lookahead tokens (usually one!). - Weakness: Little program structure has been seen before predictive decisions must be made. 2. Backtracking parsers - Backtracking parsers solve the lookahead problem by backtracking if one decision turns out to be wrong and making a different choice. - Weakness: Backtracking parsers are slow (exponential time in general). CPSC4600

  48. Recursive-descent parsing Main idea 1. Use the grammar rules as recipes for procedure code that “parses” the rule 2. Each non-terminal corresponds to a procedure 3. Each appearance of a terminal in the right hand side of a rule causes a token to be matched. 4. Each appearance of a non-terminal corresponds to a call of the associated procedure. CPSC4600

  49. Example: Recursive-descent Parsing F  (E) | num Code: void F() { if (token == num) match(num); else { match(‘(‘); E(); match(‘)’);// match token ‘(‘ } CPSC4600

  50. Example: Recursive-descent Parsing (2) Observation: Note how lookahead is not a problem in this example: if the token is number, go one way, if the token is ‘(‘ go the other, and if the token is neither, declare error: void match(Token expect) { if (token == expect) getToken(); //get next token else error(token,expect); } CPSC4600

More Related