Parsing

Parsing Chapter 15

The Job of a Parser Given a context-free grammar G: • Examine a string and decide whether or not it is a syntactically well-formed member of L(G), and • If it is, assign to it a parse tree that describes its structure and thus can be used as the basis for further interpretation.

Problems with Solutions So Far • We want to use a natural grammar that will produce a natural parse tree. But: • decideCFLusingGrammar, requires a grammar that is in Chomsky normal form. • decideCFLusingPDA, requires a grammar that is in Greibach normal form. • We want an efficient parser. But both procedures require search and take time that grows exponentially in the length of the input string. • All either procedure does is to determine membership in L(G). It does not produce parse trees.

Easy Issues • Actually building parse trees: Augment the parser with a function that builds a chunk of tree every time a rule is applied. • Using lookahead to reduce nondeterminism: It is often possible to reduce (or even eliminate) nondeterminism by allowing the parser to look ahead at the next one or more input symbols before it makes a decision about what to do.

Dividing the Process • Lexical analysis: done in linear time with a DFSM • Parsing: done in, at worst O(n3) time.

Lexical Analysis level = observation - 17.5; Lexical analysis produces a stream of tokens: id = id - id

Specifying id with a Grammar ididentifier | integer | float identifierletteralphanum alphanumletteralphnum | digitalphnum |  integer - unsignedint | unsignedint unsignedintdigit | digit unsignedint digit0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ….

Using Reg Ex’s to Specify an FSM There exist simple tools for building lexical analyzers. The first important such tool: Lex

Lex Rules Get rid of blanks and tabs: [ \t]+; Find identifiers: [A-Za-z][A-Za-z0-9]* {return(ID); } Return INTEGER and save a value: [0-9]+ {sscanf(yytext, "%d", &yylval); return (INTEGER); }

Dealing with Rule Conflicts • A longer match is preferred over a shorter one. • When lengths are equal, choose the first one. • Suppose that Lex has been give the following two rules: • integer {action 1} • [a-z]+ {action 2} • Example 1: integers • Example 2: integer

Parsing • Top-down parsers: • A simple but inefficient recursive descent parser. • Modifying a grammar for top-down parsing. • LL parsing. • Bottom-up parsers: • The simple but not efficient enough Cocke-Kasami-Younger (CKY) algorithm. • LR parsing. • Parsers for English and other natural languages.

Left-Recursive Rules EE + T ET TTF TF F (E) Fid On input: id + id + id : Then: And so forth.

Removing Left-Recursive Rules

Modifying the Expression Grammar EE + T ET TTF TF F (E) Fid ETE E + TE E TFT TFT T F (E) F id becomes

Indirect Left Recursion SYa YSa Y This form too can be eliminated.

But There is a Price

Using Lookahead and Left Factoring Goal: Procrastinate branching as long as possible. To do that, we will: • Change the parsing algorithm so that it exploits the ability to look one symbol ahead in the input before it makes a decision about what to do next, and • Change the grammar to help the parser procrastinate decisions.

Exploiting Lookahead (1) F (E) (2) Fid (3) Fid(E) Looking ahead one character makes it possible to choose between rule (1) and rules(2)/(3). But how is it possible to choose between (2) and (3)?

Left Factoring (1) F (E) (2) Fid (3) Fid(E) (1) F (E) (1.5) F id X (2) X (3) X (E) becomes More generally: A1 A2 … An AA' A'1 A'2 … A'n becomes

Predictive Parsing • It will be possible to build a predictive top-down parser for a grammar G iff: • Every string that is generated by G has a unique • left-most derivation, and • It is possible to determine each step in that derivation by • looking ahead some fixed number k of characters. • In this case, we say that G is LL(k).

LL(k) Grammars • An LL(k) grammar allows a predictive parser: • that scans its input Left to right • to build a Left-most derivation • if it is allowed k lookahead symbols. • Every LL(k) grammar is unambiguous (because every string it generates has a unique left-most derivation). • But not every unambiguous grammar is LL(k).

Two Important Functions • first() is the set of terminal symbols that can occur as the first symbol in any string derived from  using RG. If  derives , then first(). • follow(A) is the set of all terminal symbols that can immediately follow whatever A produces in some string in L(G).

Computing First and Follow SAXB$ AaA |  Xc |  BbB |  first(S) = {a, c, b, $}. first(A) = {a, }. first(AX) = {a, c, }. first(AXB) = {a, c, b, }. follow(S) = . follow(A) = {c, b, $}. follow(X) = {b, $}. follow(B) = {$}.

When is a Grammar LL(1)? Whenever G contains two competing rules A and A, all of the following are true: • No terminal symbol is an element of both first() and first(). •  cannot be derived from both of  and . • If  can be derived from one of  or , assume it is . Then there may be two competing derivations: S1A2 and S1A2 12 12 12 So there must be no terminal symbol that is an element of both follow(A) and first().

Not Every CF Language is LL(k) • No inherently ambiguous language is LL(k). • Some others aren’t either: • {anbncmd : n, m 0}  {anbmcme : n, m 0} • {anbn, n 0}  {ancn, n 0} (deterministic CF)

Recursive Descent Parsing ABA | a BbB | b A(n: parse tree node labeled A) = case (lookahead = b : /* Use ABA. Invoke B on a new daughter node labeled B. Invoke A on a new daughter node labeled A. lookahead = a : /* Use Aa. Create a new daughter node labeled a.

Table-Driven LL(1) Parsing SAB$ | AC$ AaA | a BbB | b Cc

Bottom-Up Parsing • Cocke-Kasami-Younger (CKY) • Shift-reduce parsing • LR(1) parsing

CKY • Bottom-up • Chart parser • Dynamic programming • Grammar in Chomsky Normal form Row 5 Row 4 Row 3 Row 2 Row 1 id + id  id

Exploiting Chomsky Normal Form • All rules have one of the following two forms: • Xa, where a, or • XBC, where B and C are elements of V - . • So we need two techniques for filling in T: • To fill in row 1, use rules of the form Xa. • To fill in rows 2 through n, use rules of the form XBC.

The CKY Algorithm /* Fill in the first (bottom-most) row of T. For j = 1 to n do: If G contains the rule Xaj, then add X to T[1, j]. /* Fill in the remaining rows, starting with row 2. For i = 2 to n do: For j = 1 to n-i+1 do: For k = 1 to i-1 do: For each rule XYZ do: If YT[k, j] and ZT[i-k, j+k], then: #### Insert X into T[i, j]. If SGT[n, 1] then accept else reject.

A CKY Example Consider parsing the string aab with the grammar: SA B AA A Aa Ba Bb CKY begins by filling in the bottom row of T as follows: Row 3 Row 2 Row 1 Input string a a b

A CKY Example SA B AA A Aa Ba Bb Row 3 Row 2 Row 1 Input string a a b

The Complexity of CKY /* Fill in the first (bottom-most) row of T. For j = 1 to n do: If G contains the rule Xaj, then add X to T[1, j]. /* Fill in the remaining rows, starting with row 2. For i = 2 to n do: For j = 1 to n-i+1 do: For k = 1 to i-1 do: For each rule XYZ do: If YT[k, j] and ZT[i-k, j+k], then: Insert X into T[i, j]. If SGT[n, 1] then accept else reject. O(n) n – 1 n/2 n/2 O(|G|) O(1) O(n3)

Context-Free Parsing and Matrix Multiplication • CF parsing can be described as Boolean matrix multiplication. • Strassen’s algorithm: O(n2.807) • Coppersmith-Winograd algorithm O(n2.376) • Boolean matrix multiplication can be described as CF-parsing. • If P is a O(gn3-) CF parser, then P can be efficiently converted into a O(n3-/3) matrix multiplier.

Shift-Reduce Parsing A bottom-up left-to-right parser that can do two things: • Shift an input symbol onto the parser’s stack and build, in the parse tree, a terminal node labeled with that input symbol. • Reduce a string of symbols from the top of the stack to a nonterminal symbol, using one of the rules of the grammar. Each time it does this, it also builds the corresponding piece of the parse tree.

A Shift-Reduce Example Parse: id + id  id Using: (1) EE+T (2) ET (3) TT*F (4) TF (5) F(E) (6) Fid

A Shift-Reduce Example (1) EE+T (2) ET (3) TT*F (4) TF (5) F(E) (6) Fid id + id  id

A Shift-Reduce Example (1) EE+T (2) ET (3) TT*F (4) TF (5) F(E) (6) Fid id+ id id

Parsing

Parsing

Presentation Transcript

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing