760 likes | 966 Views
Agenda. Scanner vs. parser Regular grammar vs. context-free grammar Grammars (context-free grammars) grammar rules derivations parse trees ambiguous grammars useful examples Reading: Chapter 2, 4.1 and 4.2 ,. Characteristics of a Parser. Input: sequence of tokens from scanner
E N D
Agenda • Scanner vs. parser • Regular grammar vs. context-free grammar • Grammars (context-free grammars) • grammar rules • derivations • parse trees • ambiguous grammars • useful examples • Reading: • Chapter 2, 4.1 and 4.2 , CPSC4600
Characteristics of a Parser • Input: sequence of tokens from scanner • Output: parse tree of the program • parse tree is generated (implicitly or explicitly) if the input is a legal program • if input is an illegal program, syntax errors are issued • Note: • Instead of parse tree, some parsers produce directly: • abstract syntax tree (AST) + symbol table , or • intermediate code, or • object code • In the following lectures, we’ll assume that parse tree is generated. CPSC4600
Comparison with Lexical Analysis CPSC4600
Example • The program: • x * y + z • Input to parser: • ID TIMES ID PLUS ID • we’ll write tokens as follows: • id * id + id • Output of parser: • the parse tree E E + E E * E id id id CPSC4600
Why are Regular Grammars Not Enough? Write an automaton that accepts strings • “a”, “(a)”, “((a))”, and “(((a)))” • “a”, “(a)”, “((a))”, “(((a)))”, …“(ka)k” CPSC4600
What must parser do? • Recognizer: not all strings of tokens are programs • must distinguish between valid and invalid strings of tokens • Translator: must expose program structure • e.g., associativity and precedence • hence must return the parse tree We need: • A language for describing valid strings of tokens • context-free grammars • (analogous to regular grammars in the scanner) • A method for distinguishing valid from invalid strings of tokens (and for building the parse tree) • the parser • (analogous to the state machine in the scanner) CPSC4600
Context-free grammars (CFGs) • Example: Simple Arithmetic Expressions Grammar • In English: • An integer is an arithmetic expression. • If exp1 and exp2 are arithmetic expressions, then so are the following: exp1 - exp2 exp1 / exp2 ( exp1 ) • the corresponding CFG: we’ll write tokens as follows: exp INTLITERAL E intlit exp exp MINUS exp E E - E exp exp DIVIDE exp E E / E exp LPAREN exp RPAREN E ( E ) CPSC4600
Reading the CFG • The grammar has five terminal symbols: • intlit, -, /, (, ) • terminals of a grammar = tokens returned by the scanner. • The grammar has one non-terminal symbol: • E • non-terminals describe valid sequences of tokens • The grammar has four productions or rules, • each of the form: E • left-hand side = a single non-terminal. • right-hand side = either • a sequence of one or more terminals and/or non-terminals, or • (an empty production); CPSC4600
Example, revisited • Note: • a more compact way to write previous grammar: E INTLITERAL | E - E | E / E | ( E ) or E INTLITERAL | E - E | E / E | ( E ) CPSC4600
A formal definition of CFGs • A CFG consists of • A set of terminals T • A set of non-terminals N • A start symbolS (a non-terminal) • A set of productions: • X X1 X2 … Xn • where X N and Yi T U N U {} CPSC4600
Notational Conventions • In these lecture notes • Non-terminals are written upper-case • Terminals are written lower-case • The start symbol is the left-hand side of the first production CPSC4600
The Language of a CFG The language defined by a CFG is the set of strings that can be derived from the start symbol of the grammar. Derivation: Read productions as rules: X Y1… Yn Means X can be replaced by Y1… Yn CPSC4600
Derivation: key idea 1. Begin with a string consisting of the start symbol “S” 2. Replace any non-terminal Xin the string by a the right-hand side of some production 3. Repeat (2) until there are no non-terminals in the string CPSC4600
Derivation: an example derivation: CFG: E id E E + E E E * E E ( E ) Is string id * id + id in the language defined by the grammar? CPSC4600
Terminals • Terminals are called so because there are no rules for replacing them • Once generated, terminals are permanent • Therefore, terminals are the tokens of the language CPSC4600
The Language of a CFG (Cont.) More formally, write X1 X2… Xn X1 X2… X i-1 Y1 Y2… Ym X i+1… Xn if there is a production X i Y1 Y2… Ym CPSC4600
The Language of a CFG (Cont.) Write X1 X2… Xn* Y1 Y2… Ym if X1 X2… Xn… .. Y1 Y2… Ym in 0 or more steps CPSC4600
The Language of a CFG Let G be a context-free grammar with start symbol S. Then the language of Gis: {a1 a2… an | S * a1 a2… an} where ai, i= 1,2, .., n are terminal symbols CPSC4600
Examples Strings of balanced parentheses The grammar: sameas CPSC4600
Arithmetic Expression Example Simple arithmetic expressions: Some elements of the language: CPSC4600
Notes The idea of a CFG is a big step. But: • Membership in a language is “yes” or “no” • we also need parse tree of the input! • furthermore, we must handle errors gracefully • Need an “implementation” of CFG’s, • i.e. the parser • we’ll create the parser using a parser generator • available generators: CUP, bison, yacc CPSC4600
More Notes • Form of the grammar is important • Many grammars generate the same language • Parsers are sensitive to the form of the grammar • Example: E E + E | E – E | intlit is not suitable for an LL(1) parser (a common kind of parser). CPSC4600
Derivations and Parse Trees A derivation is a sequence of productions S .. .. .. A derivation can be drawn as a tree • Start symbol is the tree’s root • For a production X Y1 Y2 add children Y1 Y2 to node X CPSC4600
Derivation Example • Grammar • String CPSC4600
Derivation Example (Cont.) E E + E E * E id id id CPSC4600
Notes on Derivations • A parse tree has • Terminals at the leaves • Non-terminals at the interior nodes • An in-order traversal of the leaves is the original input • The parse tree shows the association of operations, the input string does not CPSC4600
Left-most and Right-most Derivations • The example is a left-most derivation • At each step, replace the left-most non-terminal • There is an equivalent notion of a right-most derivation CPSC4600
Derivations and Parse Trees • Note that right-most and left-most derivations have the same parse tree • The difference is the order in which branches are added CPSC4600
Remarks on Derivation • We are not just interested in whether s e L(G) • We need a parse tree for s, (because we need to build the AST) • A derivation defines a parse tree • But one parse tree may have many derivations • Left-most and right-most derivations are important in parser implementation CPSC4600
Ambiguity(1) • Grammar • String CPSC4600
Ambiguity (2) This string has two parse trees E E E + E E E * E E id id E + E * id id id id CPSC4600
Ambiguity(3) • for each of the two parse trees, find the corresponding left-most derivation • for each of the two parse trees, find the corresponding right-most derivation CPSC4600
Ambiguity (4) • A grammar is ambiguousif, for some string of the language • it has more than one parse tree, or • there is more than one right-most derivation, or • there is more than one left-most derivation. (the three conditions are equivalent) Ambiguity Leaves meaning of some programs ill-defined CPSC4600
Dealing with Ambiguity • There are several ways to handle ambiguity • Most direct method is to rewrite grammar unambiguously • Enforces precedence of * over + CPSC4600
Removing Ambiguity • Rewriting: • Expression Grammars • precedence • associativity • IF-THEN-ELSE • the Dangling-ELSE problem CPSC4600
Handling operator precedence • Rewrite the grammar • use a different nonterminal for each precedence level • start with the lowest precedence (MINUS) E E - E | E / E | ( E ) | id rewrite to E E - T | T T T / F | F F id | ( E ) CPSC4600
Example E parse tree for id – id / id E E - T | T T T / F | F F id | ( E ) T E - T / T F F F id id id CPSC4600
Handling Operator Associativity • The grammar captures operator precedence, but it is still ambiguous! • fails to express that both subtraction and division are left associative; • e.g., 5-3-2 is equivalent to: ((5-3)-2) and not to: (5-(3-2)). CPSC4600
Recursion • A grammar is recursive in nonterminal X if: • X +…X … • + means“in one or more steps, X derives a sequence of symbols that includes an X” • A grammar is left recursive in X if: • X + X … • in one or more steps, X derives a sequence of symbols that starts with an X • A grammar is right recursive in X if: • X +… X • in one or more steps, X derives a sequence of symbols that ends with an X CPSC4600
Resolving ambiguity due to associativity • The grammar given above is both left and right recursive in nonterminals E and T • To correctly expresses operator associativity: • For left associativity, use left recursion. • For right associativity, use right recursion. • Here's the correct grammar: E E – T | T T T / F | F F id | ( E ) CPSC4600
The Dangling “Else” ambiguity • Consider the grammar St if E then St | if E then St else St | other • This grammar is also ambiguous CPSC4600
Resolving the “dangling else” • else matches the closest unmatched then • We can describe this in the grammar E MIF /* all then are matched */ | UIF /* some then are unmatched */ MIF if E then MIF else MIF | print UIF if E then E | if E then MIF else UIF • Describes the same set of strings CPSC4600
Precedence and Associativity Declarationsin Parser Generators • Instead of rewriting the grammar • Use the more natural (ambiguous) grammar • Along with disambiguating declarations • Most parser generators allow precedence and associativity declarations to disambiguate grammars CPSC4600
Parsing Approaches • Top-down parsing • build parse tree from start symbol (root) • match terminal symbols(tokens) in the production rules with tokens in the input stream • simple but limited in power • Bottom-up parsing • start from input token stream • build parse tree from terminal symbols (tokens) until get start symbol • complex but powerful CPSC4600
Top Down vs.Bottom Up start here result match start here result input token stream input token stream Top-down Parsing Bottom-up Parsing CPSC4600
Top-down Parsing A top-down parsing algorithm parses an input string of tokens by tracing out the steps in a leftmost derivation. The parse tree associated with the input string is constructed using preorder traversal and hence the name “top-down”. CPSC4600
Top-down parsers There are mainly two kinds of top-down parsers: 1. Predictive parsers - Tries tomake decisions about the structure of the tree below a node based on a few lookahead tokens (usually one!). - Weakness: Little program structure has been seen before predictive decisions must be made. 2. Backtracking parsers - Backtracking parsers solve the lookahead problem by backtracking if one decision turns out to be wrong and making a different choice. - Weakness: Backtracking parsers are slow (exponential time in general). CPSC4600
Recursive-descent parsing Main idea 1. Use the grammar rules as recipes for procedure code that “parses” the rule 2. Each non-terminal corresponds to a procedure 3. Each appearance of a terminal in the right hand side of a rule causes a token to be matched. 4. Each appearance of a non-terminal corresponds to a call of the associated procedure. CPSC4600
Example: Recursive-descent Parsing F (E) | num Code: void F() { if (token == num) match(num); else { match(‘(‘); E(); match(‘)’);// match token ‘(‘ } CPSC4600
Example: Recursive-descent Parsing (2) Observation: Note how lookahead is not a problem in this example: if the token is number, go one way, if the token is ‘(‘ go the other, and if the token is neither, declare error: void match(Token expect) { if (token == expect) getToken(); //get next token else error(token,expect); } CPSC4600