270 likes | 280 Views
Fall 2011. The Chinese University of Hong Kong. CSCI 3130: Automata theory and formal languages. Ambiguity Parsing algorithms. Andrej Bogdanov http://www.cse.cuhk.edu.hk/~andrejb/csc3130. Ambiguity. E E + E | E * E | ( E ) | N N 1 N | 2 N | 1 | 2. . E. 1+2*2. E. E. E.
E N D
Fall 2011 The Chinese University of Hong Kong CSCI 3130: Automata theory and formal languages AmbiguityParsing algorithms Andrej Bogdanov http://www.cse.cuhk.edu.hk/~andrejb/csc3130
Ambiguity E E + E | E * E | (E)| N N 1N | 2N | 1 | 2 E 1+2*2 E E E * E + E + E E N = 5 = 6 N E * E N N 2 N N 1 A CFG is ambiguous if some string has more than one parse tree 1 2 2 2
Example Is S SS | x ambiguous? Yes, because S S S S S S S S S S xxx x x x x x x
Disambiguation S SS | x S Sx | x S S S x x x Sometimes we can rewrite the grammar to remove ambiguity
Disambiguation E E + E | E * E | (E)| N N 1N | 2N | 1 | 2 same precedence! F F T T Divide expression into terms and factors F F 2 * (1 + 2 * 2)
Disambiguation E E + E | E * E | (E)| N N 1N | 2N | 1 | 2 An expression is a sum of one or more terms E T | E + T Each term is a product of one or more factors T F | T * F Each factor is a parenthesized expression or a number F (E) |1 | 2
Parsing example E E T | E + T T F | T * F F (E) |1 | 2 E T + T F F T * E ( ) F E + T E T T F + * T F F F 2 * (1 + 1 + 2 * 2) + 1
Disambiguation • Disambiguation is not always possible because • There exist inherently ambiguous languages • There is no general procedure for disambiguation • In programming languages, ambiguity comes from precedence rules, and we can do like in example • In English, ambiguity is sometimes a problem: He ate the cookies on the floor
Ambiguity in English He ate the cookies on the floor
Parsing S → 0S1 | 1S0S | T T → S | e input: 0011 How would we program the computer to build a parse tree for us?
Parsing S → 0S1 | 1S0S | T T → S | e input: 0011 ... S 0S1 00S11 000S111 ⇒ ⇒ ⇒ ⇒ 01S0S1 ⇒ ... ⇒ 0T1 00T11 ⇒ 00S11 ✔ ⇒ 0011 1S0S 10S10S ⇒ ⇒ ... ... T S ⇒ ⇒ ⇒ First idea: Try all derivations
Problems Trying all derivations may take a very long time If input is not in the language, parsing will never stop
When to stop Idea 2: Stop when S → 0S1 | 1S0S | T T → S | e |derived string| > |input| Problems: S T S T … S 0S1 0T1 01 Derived strings may shrink because of “e-productions” 1 3 3 Derivation may loop because of “unit productions” 2 Task: remove e and unit productions
Removal of -productions • A variable N is nullable if it derives the empty string * N • Identify all nullable variables N • Remove nullable variables carefully • If start variable S is nullable: Add a new start variable S’Add special productionsS’ → S |
Example grammar nullable variables B C D S ACD A a B C ED | D BC | b E b • Repeat the following: • If X , mark X as nullable • If X YZ…W, all marked nullable, • mark X as nullable also. • Identify all nullable variables
Eliminating e-productions D C S AD D B D e S AC S A C E S ACD A a B C ED | D BC | b E b • For every nullable N: • If you see X → N, add X → nullable:B, C, D • If you see N → , remove it. • Remove nullable variables carefully
Eliminating unit productions • A unit production is a production of the form A → B grammar: unit productions graph: S → 0S1 | 1S0S | T T → S | R | R → 0SR S T R
Removal of unit productions • If there is a cycle of unit productionsdelete it and replace everything with A A → B→ ... → C→ A S T S → 0S1 | 1S0S | T T → S | R | R → 0SR S → 0S1 | 1S0S S → R | R → 0SR R replace T by S
Removal of unit productions • Replace every chainby A → , B→ ,... , C → A → B→ ... → C→ S S → 0S1 | 1S0S | R | R → 0SR S → 0S1 | 1S0S | 0SR | R → 0SR R S → R → 0SR is replaced by S → 0SR, R → 0SR
Recap If input is not in the language, parsing will never stop Problem: Solution: Eliminate productions Eliminate unit productions important to do in this order Try all possible derivations but stop parsing when |derived string| > |input|
Example S → 0S1 | 0S0S | T T → S | 0 S → 0S1 | 1S0S | 0 input:0011 conclusion: 0011∉ L S 0 ⇒ ✘ 001 0S1 ✘ ⇒ ⇒ too long 00S11 too long 00S0S1 0S0S 000S 0000 ⇒ ✘ ⇒ ⇒ too long too long 1000S1 00S10S too long too long 1000S0S 00S0S0S
Problems Trying all derivations may take a very long time If input is not in the language, parsing will never stop
Preparations • To use it we must prepare the CFG: A faster way to parse:the Cocke-Younger-Kasami algorithm Eliminate productions Eliminate unit productions Convert CFG to Chomsky Normal Form
Chomsky Normal Form • A CFG is in Chomsky Normal Formif every production* has the form A → a A → BC or Noam Chomsky Convert to Chomsky Normal Form: A → BcDE A → BX X → CY Y → DE A → BCDE C → c break up sequences with new variables replace terminals with new variables C → c * Exception: We allow S→ e for start variable only
Cocke-Younger-Kasami algorithm SAC S AB | BC A BA | a B CC | b C AB | a – SAC – B B SA B SC SA B AC AC B AC x = baaba b a a b a Idea: We generate each substring of x bottom up
SAC – SAC – B B SA B SC SA B AC AC B AC b a a b a Parse tree reconstruction S AB | BC A BA | a B CC | b C AB | a x = baaba Tracing back the derivations, we obtain the parse tree
Cocke-Younger-Kasami algorithm Grammar without e and unit productions in Chomsky Normal Form table cells Input stringx = x1…xk 1k … … • For all cells in last rowIf there is a production A xiPutA in table cell ii • For cells st in other rows If there is a production A BC whereB is in cell sj and C is in cell (j+1)tPutA in cell st 23 12 22 kk 11 x1 x2 … xk s j t k 1 Cell ij remembers all possible derivations of substring xi…xj