Table-driven parsing

Table-driven parsing • Parsing performed by a finite state machine, augmented by a stack. • FSM driven by table (s) generated automatically from grammar. • Language generator tables Input parser stack tables

Pushdown Automata • A context-free grammar can be recognized by a finite state machine with a stack: a PDA. • The PDA is defined by set of internal states and a transition table. • The PDA can read the input and read/write on the stack. • The actions of the PDA are determined by its current state, the current top of the stack, and the current input symbol. • The actions consist of switching to a new state and rewriting the top of the stack, i.e. A  B1…Bk • There are three distinguished states: • start state: nothing seen • accept state: sentence complete • error state: current symbol doesn’t belong.

Top-down parsing • Parse tree is synthesized from the root (sentence symbol). • Stack contains symbols of rhs of current production, and pending non-terminals. • Automaton is trivial (no need for explicit states) • Transition table indexed by grammar symbol G and input symbol a. Entries in table are terminals or productions: P ABC…

Top-down parsing • Actions: • initially, stack contains sentence symbol S. • At each step, let X be symbol on top of stack, and a be the next token on input. • if X=a, read token, pop X from stack. • if M (X, a) is production X ::= ABC…., remove X from stack, push the symbols A, B, C on the stack (A on top). • If stack is empty and a is the end of file, accept. • Otherwise, signal error. • Semantic action: when starting a production, build tree node for non-terminal, attach to parent.

Table-driven parsing and recursive descent parsing • Recursive descent: every production is a procedure. Call stack holds active procedures corresponding to pending non-terminals. • Stack still needed for context-sensitive legality checks, error messages, etc. • Table-driven parser: recursion simulated with explicit stack.

Building the parse table • Define two functions on the symbols of the grammar: FIRST and FOLLOW. • For a non-terminal N, FIRST (N) is the set of terminal symbols that can start any derivation from N. • First (If_Statement) = {if} • First (Expr) = {id, ( } • FOLLOW (N) is the set of terminals that can appear after a string derived from N: • Follow (Expr) = {+, ), $ }

Computing FIRST (N) • If N e First (N) includes e • if N aABC First (N) includes a • if N AB First (N) includes First (A) • if N AB… and A e, • First (N) includes First (B) • Obvious generalization to First (a) where a is AB...

Computing First (N) • Grammar for expressions, without left-recursion: E TE’ | T E’ +TE’ | e T FT’ | F T’ *FT’ | e F id | (E) • First (F) = { id, ( } • First (T’) = { *, e} First (T) = { id, ( } • First (E’) = { +, e} First (E) = { id, ( }

Computing Follow (N) • Follow (N) is computed from productions in which N appears on the rhs • For the sentence symbol S, Follow (S) includes $ • if A a N b, Follow (N) includes First (b) • because an expansion of N will be followed by an expansion from b • if A a N, Follow (N) includes Follow (A) • because N will be expanded in the context in which A is expanded • if A a N B , B e, Follow (N) includes Follow (A)

Computing Follow (N) E TE’ | T E’ +TE’ | e T FT’ | F T’ *FT’ | e F id | (E) • Follow (E) = { ), $ } Follow (E’) = { ), $ } • Follow (T) = First (E’ ) + Follow (E’) = { +, ), $ } • Follow (T’) = Follow (T) = { +, ), $ } • Follow (F) = First (T’) + Follow (T’) = { *, +, ), $ }

Building LL (1) parse tables Table indexed by non-terminal and token. Table entry is a production: for each production P: A aloop for each terminal ain First (a) loop M (A, a) := P; end loop; ifein First (a), then for each terminal b in Follow (a) loop M (A, b) := P; end loop; end if; end loop; • All other entries are errors. • If two assignments conflict, parse table cannot be built.

LL (1) grammars • If table construction is successful, grammar is LL (1): left-to right, leftmost derivation with one-token lookahead. • If construction fails, can conceive of LL (2), etc. • Ambiguous grammars are never LL (k) • If a terminal is in First for two different productions of A, the grammar cannot be LL (1). • Grammars with left-recursion are never LL (k) • Some useful constructs are not LL (k)

Bottom-up parsing • Synthesize tree from fragments • Automaton performs two actions: • shift: push next symbol on stack • reduce: replace symbols on stack • Automaton synthesizes (reduces) when end of a production is recognized • States of automaton encode synthesis so far, and expectation of pending non-terminals • Automaton has potentially large set of states • Technique more general than LL (k)

LR (k) parsing • Left-to-right, rightmost derivation with k-token lookahead. • Most general parsing technique for deterministic grammars. • In general, not practical: tables too large (10^6 states for C++, Ada). • Common subsets: SLR, LALR (1).

The states of the LR(0) automaton • An item is a point within a production, indicating that part of the production has been recognized: • A a . B b , • seen the expansion of a, expect to see expansion of B • A state is a set of items • Transition within states are determined by terminals and non-terminals • Parsing tables are built from automaton: • action: shift / reduce depending on next symbol • goto: change state depending on synthesized non-terminal

Building LR (0) states • If a state includes: A a . B b • it also includes every state that is the start of B: B . X Y Z • Informally: if I expect to see B next, I expect to see anything that B can start with, and so on: X . G H I • States are built by closure from individual items.

A grammar of expressions: initial state • E’ E • E E + T | T; -- left-recursion ok here. • T T * F | F; • F id | (E) • S0 = { E’ .E, E .E + T, E .T, F .id, F . ( E ) , T .T * F, T .F}

Adding states • If a state has itemA a .a b, and the next symbol in the input is a, we shifta on the stack and enter a state that contains item • A a a.b (as well as all other items brought in by closure) • if a state has as item A a. , this indicates the end of a production: reduce action. • If a state has an item A a .N b, then after a reduction that find an N, go to a state with A a N. b

The LR (0) states for expressions • S1 = { E’ E., E E. + T } • S2 = { E T., T T. * F } • S3 = { T F. } • S4 = { F (. E), } + S0 (by closure) • S5 = { F id. } • S6 = { E E +. T, T .T * F, T .F, F .id, F .(E)} • S7 = { T T *. F, F .id, F .(E)} • S8 = { F (E.), E E.+ T} • S9 = { E E + T., T T.* F} • S10 = { T T * F.}, S11 = {F (E).}

Building SLR tables • An arc between two states labeled with a terminal is a shift action. • An arc between two states labeled with a non-terminal is a goto action. • if a state contains an item A a. , (a reduce item) • the action is to reduce by this production, for all terminals in Follow (A). • If there are shift-reduce conflicts or reduce-reduce conflicts, more elaborate techniques are needed.

LR (k) parsing • Canonical LR (1): annotate each item with its own follow set: • (A -> a a.b , f ) • f is a subset of the follow set of A, because it is derived from a single specific production for A • A state that includes A -> a a.b is a reduce state only if next symbol is in f: fewer reduce actions, fewer conflicts, technique is more powerful than SLR (1) • Generalization: use sequences of k symbols in f • Disadvantage: state explosion: impractical in general, even for LR (1)

LALR (1) • Compute follow set for a small set of items • Tables no bigger than SLR (1) • Same power as LR (1), slightly worse error diagnostics • Incorporated into yacc, bison, etc.

Table-driven parsing