920 likes | 964 Views
Learn about parsing to extract the grammatical structure of a sentence using context-free grammars. Explore tabular parsing methods, context-free languages, parse trees, syntax trees, and ambiguity resolution in grammar.
E N D
Parsing Giuseppe Attardi Università di Pisa
Parsing To extract the grammatical structure of a sentence, where: sentence = program words = tokens For further information: Aho, Sethi, Ullman, “Compilers: Principles, Techniques, and Tools” (a.k.a, the “Dragon Book”)
Outline of coverage • Context-free grammars • Parsing • Tabular Parsing Methods • One pass • Top-down • Bottom-up • Yacc
Grammatical structure of program function-def name arguments stmt-list stmt () main expression expression operator expression variable << string cout “hello, world\n”
Context-free languages Grammatical structure defined by context-free grammar statementassignment ; statementexpression ;statementcompound-statementassignmentident =expressioncompound-statement{declaration-list statement-list } Definition (Context-free). Grammar with only one non-terminal symbol in left hand side of productions.
Context Free Grammar G = (V, S, P, S) • V is a finite set of non-terminal symbols • S is a finite set of terminal symbols • P is a finite set of productions V (VS)* • S is the start symbol
Parse trees • “Parse tree from A” = root labeled with A • “Complete parse tree” = all leaves labeled with tokens Parse tree: tree labeled with grammar symbols, such that: if node is labeled A and its children are labeled x1...xn, then there is a productionAx1...xn
L L ; E E “Frontier” a Parse trees and sentences • Frontier of tree = labels on leaves (in left-to-right order) • Frontier of tree from S is a sentential form Definition. A sentence is the frontier of a complete tree from S.
L L L L L E ; E ; L ; E E b E E b a a a a a;E a;b;b Example G: L L ; E | E E a | b Syntax trees from start symbol (L): Sentential forms:
Derivations Alternate definition of sentential form: • Given , in V*, say is a derivation step if ’’’ and = ’’’ , where A is a production • is a sentential form iff there exists a derivation (sequence of derivation steps) S (alternatively, we say that S * ) Two definitions are equivalent, but note that there are many derivations corresponding to each parse tree
L E a Another example H: L E ; L | E E a | b L L L L E E ; ; L ; E E a E b a b
E E + * E E E E * E E id + id E E id id id id Ambiguity • A sentence can have more than one parse tree • A grammar is ambiguous if there is a sentence with more than one parse tree • Example 1 E E+E | E*E | id
Notes • If e then if b then d else f • { int x; y = 0; } • a.b.c = d; • Id -> s | s.id E -> E + T -> E + T + T -> T + T + T -> id + T + T -> id + T * id + T -> id + id * id + T -> id + id * id + id
Ambiguity • Ambiguity is a feature of the grammar rather than the language • Certain ambiguous grammars may have equivalent unambiguous ones
Grammar Transformations • Grammars can be transformed without affecting the language generated • Three transformations are discussed next: • Eliminating Ambiguity • Eliminating Left Recursion (i.e. productions of the form AA ) • Left Factoring
Eliminating Ambiguity • Sometimes an ambiguous grammar can be rewritten to eliminate ambiguity • For example, the grammar of Example 1 can be written as follows: • E E +T | T • T E *id| id • The language generated by this grammar is the same as that generated by the previous grammar. Both generate id(+id|*id)* • However, this grammar is not ambiguous
E + E T id * T T id id Eliminating Ambiguity (Cont.) • One advantage of this grammar is that it represents the precedence between operators. In the parsing tree, products appear nested within additions
Eliminating Ambiguity (Cont.) • An example of ambiguity in a programming language is the dangling else • Consider • S ifbthenSelseS | ifbthenS | a
S S S if b then else ifbthen S a a S S b if then S S b if then else a a Eliminating Ambiguity (Cont.) • When there are two nested ifs and only one else..
Eliminating Ambiguity (Cont.) • In most languages (including C++ and Java), each else is assumed to belong to the nearest ifthat is not already matched by an else. • This association is expressed in the following (unambiguous) grammar: • S Matched • | Unmatched • Matched ifbthen Matched else Matched • | a • Unmatched ifb then S • |ifbthen Matched else Unmatched
Eliminating Ambiguity (Cont.) • Ambiguity is a property of the grammar • It is undecidable whether a context free grammar is ambiguous • The proof is done by reduction to Post’s correspondence problem • Although there is no general algorithm, it is possible to isolate certain constructs in productions which lead to ambiguous grammars
Eliminating Ambiguity (Cont.) • For example, a grammar containing the production AAA |a would be ambiguous, because the substring aaa has two parses: A A A A A A A A a A A a a a a a • This ambiguity disappears if we use the productions • AAB |B and B a or the productions • ABA |B and B a.
Eliminating Ambiguity (Cont.) • Examples of ambiguous productions: AAaA AaA |Ab AaA |aAbA • A CF language is inherently ambiguous if it has no unambiguous CFG • An example of such a language is L = {aibjcm | i=j or j=m} which can be generated by the grammar: • SAB | DC • AaA | e CcC | e • BbBc | e DaDb | e
Elimination of Left Recursion • A grammar is left recursive if it has a nonterminalA and a derivation A + Aafor some string a. • Top-down parsing methods cannot handle left-recursive grammars, so a transformation to eliminate left recursion is needed • Immediate left recursion (productions of the form A Aa) can be easily eliminated: • Group the A-productions as • A A a1 |A a2 |… | A am| b1| b2 | … | bn • where no bi begins with A 2. Replace the A-productions by • A b1A’| b2A’ | … | bnA’ • A’ a1A’|a2A’|… | amA’| e
Elimination of Left Recursion (Cont.) • The previous transformation, however, does not eliminate left recursion involving two or more steps • For example, consider the grammar • S Aa|b • A Ac|Sd |e • S is left-recursive because S Aa Sda, but it is not immediately left recursive
Elimination of Left Recursion (Cont.) Algorithm. Left recursion elimination. Arrange nonterminals in some order A1, A2 ,,…, An for i = 1 to n { • for j = 1 to i - 1 { • replace each production of the form • Ai Ajg • by • Ai d1 g| … | dng • where Aj d1 |…| dnare the current productions for Aj • } • eliminate the immediate left recursion among the Ai-productions }
Elimination of Left Recursion (Cont.) • Notice that iteration i only changes productions with Ai on the left-hand side, and Aj with j > i in the right-hand side • Correctness induction proof: • Clearly true for i = 1 • If true for all i < k, then when the outer loop is executed for i = k, the inner loop will remove all productions Ai Aj with j<i • Finally, after the elimination of self recursion, m in any Ai Am productions will be > i • At the end of the algorithm, all derivations of the form Ai + Amawill have m > i and therefore left recursion will not be present
Left Factoring • Left factoring helps transform a grammar for predictive parsing • For example, if we have the two productions • S ifbthenSelseS • | ifbthenS on seeing the input token if, we cannot immediately tell which production to choose to expand S • In general, if we have A b1 |b2 and the input begins with a, we do not know(without looking further) which production to use to expand A
Left Factoring (Cont.) • However, we may defer the decision by expanding A to A’ • Then after seeing the input derived from , we may expand A’ to 1 or to2 • Left-factored, the original productions become • AA’ • A’ b1 | b2
Non-Context-Free Language Constructs • Examples of non-context-free languages are: • L1 = {wcw | w is of the form (a|b)*} • L2 = {anbmcndm | n 1 and m 1 } • L3 = {anbncn | n 0 } • Languages similar to these that are context free • L'1 = {wcwR | w is of the form (a|b)*} (wRstands for w reversed) • This language is generated by the grammar SaSa | bSb | c • L'2 = {anbmcmdn | n 1 and m 1 } • This language is generated by the grammar SaSd | aAd AbAc | bc
Non-Context-Free Language Constructs (Cont.) • L''2 = {anbncmdm | n1and m 1} • is generated by the grammar SAB AaAb | ab BcBd | cd • L'3 = {anbn | n 1} • is generated by the grammar SaSb | ab • This language is not definable by any regular expression
CFG vs DFSA • L‘4 = {anbm | n> 0, n> 0} a b start 0 1 b
Non-Context-Free Language Constructs (Cont.) • Suppose we could construct a DFSM D accepting L'3. • D must have a finite number of states, say k. • Consider the sequence of states s0, s1, s2, …, sk entered by D having read , a, aa, …, ak. • Since D only has k states, two of the states in the sequence have to be equal. Say,sisj (i j). • From si, a sequence of ibs leads to an accepting (final) state. Therefore, the same sequence of ibs will also lead to an accepting state from sj. Therefore D would accept ajbi which means that the language accepted by D is not identical to L’3. A contradiction.
Parsing The parsing problem is: Given string of tokens w, find a parse tree whose frontier is w. (Equivalently, find a derivation from w) A parser for a grammar G reads a list of tokens and finds a parse tree if they form a sentence (or reports an error otherwise) Two classes of algorithms for parsing: • Top-down • Bottom-up
Parser generators • A parser generator is a program that reads a grammar and produces a parser • The best known parser generator is yacc It produces bottom-up parsers • Most parser generators - including yacc - do not work for every CFG; they accept a restricted class of CFG’s that can be parsed efficiently using the method employed by that parser generator
Top-down (predictive) parsing • Starting from parse tree containing just S, build tree down toward input. Expand left-most non-terminal.
Top-down parsing algorithm Let input = a1a2...an current sentential form (csf) = S loop { suppose csf = a1…akA based on ak+1…, choose production A csf becomes a1…ak }
L L E ; L L E ; L a Top-down parsing example Grammar: H: L E ; L | E E a | b Input: a;b Parse tree Sentential form Input L a;b E;L a;b a;L a;b
L E ; L a E L E ; L a E b Top-down parsing example (cont.) Parse tree Sentential form Input a;E a;b a;b a;b
LL(1) parsing • Efficient form of top-down parsing • Use only first symbol of remaining input (ak+1) to choose next production. That is, employ a function M: N P in “choose production” step of algorithm. • When this is possible, grammar is called LL(1)
LL(1) examples • Example 1: H: L E ; L | E E a | b Given input a;b, so next symbol is a. Which production to use? Can’t tell. H not LL(1)
LL(1) examples • Example 2: Exp Term Exp’ Exp’ $ | + Exp Term id (Use $ for “end-of-input” symbol.) Grammar is LL(1): Exp and Term have only one production; Exp’ has two productions but only one is applicable at any time.
Nonrecursive predictive parsing • Maintain a stack explicitly, rather than implicitly via recursive calls • Key problem during predictive parsing: determining the production to be applied for a non-terminal
Nonrecursive predictive parsing • Algorithm. Nonrecursive predictive parsing • Set ip to point to the first symbol of w$. • Push S onto the stack. • repeat • Let X be the top of the stack symbol and a the symbol pointed to by ip • ifX is a terminal or $ then • ifX == athen • pop X from the stack and advance ip • else error() • else // X is a nonterminal • ifM[X,a] == XY1 Y2 … Y kthen • pop X from the stack • push YkYk-1, …, Y1 onto the stack with Y1 on top • (push nothing if Y1 Y2 … Y k is ) • output the production XY1 Y2 … Y k • else error() • until X == $
LL(1) grammars • No left recursion A Aa : If this production is chosen, parse makes no progress. • No common prefixes A ab | ag Can fix by “left factoring”: A aA’ A’ b | g
LL(1) grammars (cont.) • No ambiguity Precise definition requires that production to choose be unique (“choose” function M very hard to calculate otherwise)
LL(1) definition • Define: Grammar G = (N, , P, S) is LL(1)iff whenever there are two left-most derivations S * wA w * wtx S * wA w * wty • it follows that = • Leftmost-derivation: where the leftmost non-terminal is always expanded first • In other words, given • 1. a stringwA in V* and • 2. t, the first terminal symbol to be derived from A • there is at most one production that can be applied to A to • yield a derivation of any terminal string beginning with wt
Checking LL(1)-ness • For any sequence of grammar symbols , define set FIRST(a) S to be FIRST(a) = { a | a* ab for some b } • FIRST sets can often be calculated by inspection
FIRST Sets ExpTerm Exp’ Exp’$ | +Exp Termid (Use $ for “end-of-input” symbol) FIRST($) = {$} FIRST(+Exp) = {+} FIRST($) FIRST(+Exp) = {} grammar is LL(1)
FIRST Sets L E ; L | EE a | b FIRST(E ; L) = {a, b} = FIRST(E) FIRST(E ; L) FIRST(E) {} grammar not LL(1).