1 / 75

Parsing

Parsing. Giuseppe Attardi Università di Pisa. Parsing. Calculate grammatical structure of program, like diagramming sentences, where: Tokens = “words” Programs = “sentences”. For further information:

agatha
Download Presentation

Parsing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parsing Giuseppe Attardi Università di Pisa

  2. Parsing Calculate grammatical structure of program, like diagramming sentences, where: Tokens = “words” Programs = “sentences” For further information: Aho, Sethi, Ullman, “Compilers: Principles, Techniques, and Tools” (a.k.a, the “Dragon Book”)

  3. Outline of coverage • Context-free grammars • Parsing • Tabular Parsing Methods • One pass • Top-down • Bottom-up • Yacc

  4. Parser: extracts grammatical structure of program function-def name arguments stmt-list stmt main expression expression operator expression variable << string cout “hello, world\n”

  5. Context-free languages Grammatical structure defined by context-free grammar statementlabeled-statement | expression-statement | compound-statementlabeled-statementident:statement | caseconstant-expression :statementcompound-statement{declaration-list statement-list } “Context-free” = only one non-terminal in left-part terminal non-terminal

  6. Parse trees Parse tree = tree labeled with grammar symbols, such that: • If node is labeled A, and its children are labeled x1...xn, then there is a productionA x1...xn • “Parse tree from A” = root labeled with A • “Complete parse tree” = all leaves labeled with tokens

  7. L L ; E E “Frontier” a Parse trees and sentences • Frontier of tree = labels on leaves (in left-to-right order) • Frontier of tree from S is a sentential form • Frontier of a complete tree from S is a sentence

  8. L L L L L E ; E ; L ; E E b E E b a a a a a;E a;b;b Example G: L L ; E | E E a | b Syntax trees from start symbol (L): Sentential forms:

  9. Derivations Alternate definition of sentence: • Given ,  in V*, say  is a derivation step if ’’’ and  = ’’’ , where A is a production •  is a sentential form iff there exists a derivation (sequence of derivation steps) S ( alternatively, we say that S* ) Two definitions are equivalent, but note that there are many derivations corresponding to each parse tree

  10. L E a Another example H: L E ; L | E E a | b L L L L E E ; ; L ; E E b E b a a

  11. E E + * E E E E * E E id + id E E id id id id Ambiguity • For some purposes, it is important to know whether a sentence can have more than one parse tree • A grammar is ambiguous if there is a sentence with more than one parse tree • Example: E E+E | E*E | id

  12. Notes • If e then if b then d else f • { int x; y = 0; } • A.b.c = d; • Id -> s | s.id E -> E + T -> E + T + T -> T + T + T -> id + T + T -> id + T * id + T -> id + id * id + T -> id + id * id + id

  13. Ambiguity • Ambiguity is a function of the grammar rather than the language • Certain ambiguous grammars may have equivalent unambiguous ones

  14. Grammar Transformations • Grammars can be transformed without affecting the language generated • Three transformations are discussed next: • Eliminating Ambiguity • Eliminating Left Recursion (i.e.productions of the form AA ) • Left Factoring

  15. Eliminating Ambiguity • Sometimes an ambiguous grammar can be rewritten to eliminate ambiguity • For example, expressions involving additions and products can be written as follows: • E E +T | T • T T *id | id • The language generated by this grammar is the same as that generated by the grammar in slide “Ambiguity”. Both generate id(+id|*id)* • However, this grammar is not ambiguous

  16. E + E T id * T T id id Eliminating Ambiguity (Cont.) • One advantage of this grammar is that it represents the precedence between operators. In the parsing tree, products appear nested within additions

  17. Eliminating Ambiguity (Cont.) • An example of ambiguity in a programming language is the dangling else • Consider • S  ifbthenSelseS | ifbthenS | a

  18. S S S if b then else ifbthen S a a S S b if then S S b if then else a a Eliminating Ambiguity (Cont.) • When there are two nested ifs and only one else..

  19. Eliminating Ambiguity (Cont.) • In most languages (including C++ and Java), each else is assumed to belong to the nearest ifthat is not already matched by an else. This association is expressed in the following (unambiguous) grammar: • S  Matched • | Unmatched • Matched ifbthen Matched else Matched • | a • Unmatched ifb then S • |ifbthen Matched else Unmatched

  20. Eliminating Ambiguity (Cont.) • Ambiguity is a property of the grammar • It is undecidable whether a context free grammar is ambiguous • The proof is done by reduction to Post’s correspondence problem • Although there is no general algorithm, it is possible to isolate certain constructs in productions which lead to ambiguous grammars

  21. Eliminating Ambiguity (Cont.) • For example, a grammar containing the production AAA | would be ambiguous, because the substring aaa has two parses: A A A A A A A A a A A a a a a a • This ambiguity disappears if we use the productions • AAB |B and B  or the productions • ABA |B and B .

  22. Eliminating Ambiguity (Cont.) • Examples of ambiguous productions: AAaA AaA |Ab AaA |aAbA • A CF language is inherently ambiguous if it has no unambiguous CFG • An example of such a language is L = {aibjcm | i=j or j=m} which can be generated by the grammar: • SAB | DC • AaA | e CcC | e • BbBc | e DaDb | e

  23. Elimination of Left Recursion • A grammar is left recursive if it has a nonterminal A and a derivation A + Aa for some string a. • Top-down parsing methods cannot handle left-recursive grammars, so a transformation to eliminate left recursion is needed • Immediate left recursion (productions of the form A  A) can be easily eliminated: • Group the A-productions as • A  A1 |A2 |… | Am| b1| b2 | … | bn • where no bi begins with A 2. Replace the A-productions by • A  b1A’| b2A’ | … | bnA’ • A’ 1A’|2A’|… | mA’| e

  24. Elimination of Left Recursion (Cont.) • The previous transformation, however, does not eliminate left recursion involving two or more steps • For example, consider the grammar • S  Aa|b • A  Ac|Sd |e • S is left-recursive because S Aa Sda, but it is not immediately left recursive

  25. Elimination of Left Recursion (Cont.) Algorithm. Eliminate left recursion Arrange nonterminals in some order A1, A2 ,,…, An for i = 1 to n { • for j = 1 to i - 1 { • replace each production of the form Ai  Ajg • by the production Ai  d1 g| d2 g | … | dng • where Aj  d1 | d2 |…| dnare all the current Aj-productions • } • eliminate the immediate left recursion among the Ai-productions }

  26. Elimination of Left Recursion (Cont.) • To show that the previous algorithm actually works, notice that iteration i only changes productions with Ai on the left-hand side. And m > i in all productions of the form Ai  Am • Induction proof: • Clearly true for i = 1 • If it is true for all i < k, then when the outer loop is executed for i = k, the inner loop will remove all productions Ai  Am with m<i • Finally, with the elimination of self recursion, m in the Ai Am productions is forced to be > i • At the end of the algorithm, all derivations of the form Ai + Ama will have m > i and therefore left recursion would not be possible

  27. Left Factoring • Left factoring helps transform a grammar for predictive parsing • For example, if we have the two productions • S  ifbthenSelseS • | ifbthenS on seeing the input token if, we cannot immediately tell which production to choose to expand S • In general, if we have A b1 |b2 and the input begins with a, we do not know(without looking further) which production to use to expand A

  28. Left Factoring (Cont.) • However, we may defer the decision by expanding A to A’ • Then after seeing the input derived from , we may expand A’ to 1 or to2 • Left-factored, the original productions become • AA’ • A’ b1 | b2

  29. Non-Context-Free Language Constructs • Examples of non-context-free languages are: • L1 = {wcw | w is of the form (a|b)*} • L2 = {anbmcndm | n  1 and m  1 } • L3 = {anbncn | n  0 } • Languages similar to these that are context free • L’1 = {wcwR | w is of the form (a|b)*} (wR stands for w reversed) • This language is generated by the grammar SaSa | bSb | c • L’2 = {anbmcmdn | n  1 and m 1 } • This language is generated by the grammar SaSd | aAd AbAc | bc

  30. Non-Context-Free Language Constructs (Cont.) • L”2 = {anbncmdm | n  1 and m 1 } • is generated by the grammar SAB AaAb | ab BcBd | cd • L’3 = {anbn | n  1} • is generated by the grammar SaSb | ab • This language is not definable by any regular expression

  31. Non-Context-Free Language Constructs (Cont.) • Suppose we could construct a DFSM D accepting L’3. • D must have a finite number of states, say k. • Consider the sequence of states s0, s1, s2, …, sk entered by D having read , a, aa, …, ak. • Since D only has k states, two of the states in the sequence have to be equal. Say, sisj (i j). • From si, a sequence of ibs leads to an accepting (final) state. Therefore, the same sequence of ibs will also lead to an accepting state from sj. Therefore D would accept ajbi which means that the language accepted by D is not identical to L’3. A contradiction.

  32. Parsing The parsing problem is: Given string of tokens w, find a parse tree whose frontier is w. (Equivalently, find a derivation from w) A parser for a grammar G reads a list of tokens and finds a parse tree if they form a sentence (or reports an error otherwise) Two classes of algorithms for parsing: • Top-down • Bottom-up

  33. Parser generators • A parser generator is a program that reads a grammar and produces a parser • The best known parser generator is yacc It produces bottom-up parsers • Most parser generators - including yacc - do not work for every CFG; they accept a restricted class of CFG’s that can be parsed efficiently using the method employed by that parser generator

  34. Top-down parsing • Starting from parse tree containing just S, build tree down toward input. Expand left-most non-terminal. • Algorithm: (next slide)

  35. Top-down parsing (cont.) Let input = a1a2...an current sentential form (csf) = S loop { suppose csf = a1…akA based on ak+1…, choose production A   csf becomes a1…ak }

  36. L L E ; L L E ; L a Top-down parsing example Grammar: H: L E ; L | E E a | b Input: a;b Parse tree Sentential form Input L a;b E;L a;b a;L a;b

  37. L E ; L a E L E ; L a E b Top-down parsing example (cont.) Parse tree Sentential form Input a;E a;b a;b a;b

  38. LL(1) parsing • Efficient form of top-down parsing • Use only first symbol of remaining input (ak+1) to choose next production. That is, employ a function M:   N P in “choose production” step of algorithm. • When this is possible, grammar is called LL(1)

  39. LL(1) examples • Example 1: H: L E ; L | E E a | b Given input a;b, so next symbol is a. Which production to use? Can’t tell.  H not LL(1)

  40. LL(1) examples • Example 2: Exp Term Exp’ Exp’ $ | + Exp Term id (Use $ for “end-of-input” symbol.) • Grammar is LL(1): Exp and Term have only • one production; Exp’ has two productions but only one is applicable at any time.

  41. Nonrecursive predictive parsing • Maintain a stack explicitly, rather than implicitly via recursive calls • Key problem during predictive parsing: determining the production to be applied for a non-terminal

  42. Nonrecursive predictive parsing • Algorithm. Nonrecursive predictive parsing • Set ip to point to the first symbol of w$. • repeat • Let X be the top of the stack symbol and a the symbol pointed to by ip • ifX is a terminal or $ then • ifX == athen • pop X from the stack and advance ip • else error() • else // X is a nonterminal • ifM[X,a] == XY1 Y2 … Y kthen • pop X from the stack • push YkY k-1, …, Y1 onto the stack with Y1 on top • (push nothing if Y1 Y2 … Y k is  ) • output the production XY1 Y2 … Y k • else error() • until X == $

  43. LL(1) grammars • No left recursion A  Aa : If this production is chosen, parse makes no progress. • No common prefixes A ab | ag Can fix by “left factoring”: A aA’ A’  b | g

  44. LL(1) grammars (cont.) • No ambiguity Precise definition requires that production to choose be unique (“choose” function M very hard to calculate otherwise)

  45. Top-down Parsing L Start symbol and root of parse tree Input tokens: <t0,t1,…,ti,...> E0 … En L Input tokens: <ti,...> E0 … En From left to right, “grow” the parse tree downwards ...

  46. Checking LL(1)-ness • For any sequence of grammar symbols , define set FIRST(a) S to be FIRST(a) = { a | a* ab for some b}

  47. LL(1) definition • Define: Grammar G = (N, , P, S) is LL(1)iff whenever there are two left-most derivations (in which the leftmost non-terminal is always expanded first) S * wA  w * wtx S * wA  w * wty • it follows that  = • In other words, given • 1. a string wA in V* and • 2. t, the first terminal symbol to be derived from A • there is at most one production that can be applied to A to • yield a derivation of any terminal string beginning with wt • FIRST sets can often be calculated by inspection

  48. FIRST Sets • ExpTerm Exp’ • Exp’$ | +Exp • Termid • (Use $ for “end-of-input” symbol) FIRST($) = {$} FIRST(+Exp) = {+} FIRST($)  FIRST(+Exp) = {}  grammar is LL(1)

  49. FIRST Sets • L E ; L | EE a | b FIRST(E ; L) = {a, b} = FIRST(E) FIRST(E ; L)  FIRST(E)  {}  grammar not LL(1).

  50. Computing FIRST Sets • Algorithm. Compute FIRST(X) for all grammar symbols X • forall X  V do FIRST(X) = {} • forall X   (X is a terminal) do FIRST(X) = {X} • forall productions X   do FIRST(X) = FIRST(X) U {} • repeat • c: forall productions X  Y1Y2 … Yk do • forall i  [1,k] do • FIRST(X) = FIRST(X) U (FIRST(Yi) - {}) if  FIRST(Yi) then continue c • FIRST(X) = FIRST(X) U {} • until no more terminals or  are added to any FIRST set

More Related