Parsing

Parsing

Goals of Parsing • Check the input for syntactic accuracy • Return appropriate error messages • Recover if possible • Produce, or at least traverse, a complete parse tree • Parse tree (or trace) is basis for translation

Top-down Parsers • Parse tree is built from the root down to the leaves • Builds parse tree in preorder • Corresponds to a leftmost derivation • Parsing decision problem: choosing correct rule • Two most common algorithms: • Recursive Descent – implemented in code • Table driven implementation • Both are LL algorithms (left-to-right scan, left-most derivation)

Bottom-up • Parse tree is built from the leaves up to the root • Builds parse in reverse of a rightmost derivation • Requires finding a handle, that is, a correct RHS • Most common algorithms are LR (left-to-right, rightmost derivation)

Complexity • The most general parsing algorithms work for any unambiguous grammar • Complicated, inefficient • O(n^3) • Trade generality for efficiency • Commercial compilers have complexity O(n)

Recursive Descent • Parser is made up of a collection of subprograms • One for each non-terminal • Subprogram responsible for generating the parse tree rooted at the given non-terminal • Pulls tokens from the tokenizer, and leaves the first token not a part of its rule in nextToken • If multiple rules associated with the current non-terminal, first a determination of the correct rule must be made

Function Factor //<factor> -> id | (<expr>) void factor() { if (nextToken == ID_CODE) lex(); else if (nextToken == LEFT_PAREN_CODE) { lex(); expr(); if (nextToken == RIGHT_PAREN_CODE) lex(); else error(); } else error(); /* Neither RHS matches */ }

<ifstmt> ::= if ( <boolexpr> ) <stmt> [else <stmt>] void ifstmt() { if (nextToken != IF_CODE) error(); else { lex(); if (nextToken != LEFT_PAREN) error(); else { lex(); boolexpr(); if (nextToken != RIGHT_PAREN) error(); else { lex(); statement(); if (nextToken == ELSE_CODE) { lex(); statement(); } } } } }

Grammar Restrictions • Left-recursion is a problem • A ::= A + B • Parsing would never terminate! • In some cases, left-recursion can be eliminated by refactoring the grammar • E ::= E + T | T • E ::= T E’ • E’ ::= + T E’ | ε

Grammar restrictions continued • Ability to choose correct production based on a single next token • Pairwise disjointedness test indicates whether or not this choice can be accomplished • If the first terminal that can be generated from a rule is unique A ::= aB | bAb | Bb B ::= cB | d A ::= aB | Bab B ::= aB | b FIRST Sets {a} {b} {c, d} Disjoint, Recursive descent parsable FIRST Sets {a} {a,b} Not disjoint, not recursive descent parsable

Table driven parsers • Encode production choice in a table • Rows indicate current top of the stack • Columns for each input token • Entry in matrix gives production number • Preferred for large grammars • Algorithm is fixed • Only table size grows

Bottom-up Parsing • Often called shift-reduce algorithms • Integral piece of every bottom-up parser is a stack • Shift moves the next input token onto the stack • Reduce replaces a RHS on the top of the stack with the corresponding LHS • Most bottom-up parsing algorithms are variations of the LR process • Originally designed by Donald Knuth • Relatively small program and a parsing table

Advantages of LR Parsers • Will work for nearly all grammars that describe programming languages. • Work on a larger class of grammars than other bottom-up algorithms, but are as efficient as any other bottom-up parser. • Can detect syntax errors as soon as it is possible. • LR class of grammars is a superset of the class parsable by LL parsers

Disadvantage • For anything but very small grammars, it is difficult to produce by hand the parsing table • But this is exactly what tools like yacc and bison can do for us automatically! • Original version was computationally intensive (both in terms of time and memory) • Variations developed: • Less computer resources required • Not as general

Key Insight • A bottom-up parser can use the entire history of the parse, up to the current point, to make parsing decisions • There are only a finite and relatively small number of different parse situations that could have occurred, so the history can be stored in a parser state, on the parse stack

Parser Configuration • Made up of both the stack, and the input • For each state on the stack, there is an associated grammar symbol • E.g. (S0X1S1X2S2…XmSm, aiai+1…an$) where Si indicates a state, and Xi indicates a grammar symbol • Initial configuration: (S0, a0…an$)

Table driven bottom up parsing • Table has two components: • ACTION table • Specifies the action of the parser, given the parser state and the next token • Rows are state names • Columns are terminals • GOTO table • Specifies state to put in the stack after a reduce operation • Rows are state names • Columns are non-terminals

Structure of an LR parser

Parser actions • If ACTION[Sm, ai] = Shift S, the next configuration is: (S0X1S1X2S2…XmSmaiS, ai+1…an$) • If ACTION[Sm, ai] = Reduce A  and S = GOTO[Sm-r, A], where r = the length of , the next configuration is (S0X1S1X2S2…Xm-rSm-rAS, aiai+1…an$) • If ACTION[Sm, ai] = Accept, the parse is complete and no errors were found. • If ACTION[Sm, ai] = Error, the parser calls an error-handling routine.

Example LR Parsing Table • 1. E ::= E + T • 2. E ::= T • 3. T ::= T * F • 4. T ::= F • 5. F ::= ( E ) • 6. F ::= id

Trace of parse of id + id * id

Parsing

Parsing

Presentation Transcript

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing