Functional Design and Programming

Functional Design and Programming Lecture 9: Lexical analysis and parsing

Literature • Paulson, chap. 9: • Lexical analysis (9.1) • Functional parsing (9.2-9.4)

Exercises • Paulson, chap. 9: • 9.1-9.2 • 9.3-9.6, 9.8 • Write a parser for XML elements (see home page) .

Parsing/Unparsing • Purpose: Encoding/decoding structured data into flat (string) representations • Reasons: • Data read (and written) using operating system routines (“read 25 bytes from file XYZ”). • Need for universal format for all kinds of data; e.g., to allow editing with text editor.

scanner parser element stag etag transformer(s) contents “H1” “ My title” “H1” .... ... “MY TITLE” unparser Language processor architecture character stream “<H1 > My title</ H1>” [LANGLE, ID “H1”, RANGLE, ID “ My title”, LSLASH, ID “ H1”, RANGLE] token stream abstract syntax tree abstract syntax tree “<H1> MY TITLE </H1>” character stream

Lexical analysis (Scanning, lexing, tokenizing) • Purpose: Turning a character stream into a stream of tokens. • Reasons: • Making parsing easier by taking care of ‘low-level’ concerns such as eliminating whitespace. • Efficient preprocessing and compression of input to parser. • Unbounded lookahead into input stream (in contrast to most parsers) • Well-founded theoretical basis and tool support (regular expressions and finite state machines).

Context-free Grammars (CFGs) • A context-free grammar G describes a language (set of strings) • G = (T, N, P, S) where • T: set of terminal symbols • N: set of nonterminal symbols • P: set of productions • S: start symbol (a particular nonterminal symbol)

CFGs: Example T = { +, -, *, /, (, ), Var, Const } N = { Exp, Term, Factor } S = Exp Exp ::= Exp + Term | Exp - Term | Term Term :: = Term * Factor | Term / Factor | Factor Factor ::= Var | Const | ( Exp )

CFG’s: Example... Exp Exp Exp Term Term Term Term Term Factor Factor Factor Factor Factor [Var, +, Var, /, Const, -, Var, *, Var] “x + y / 15 - x * x”

Parsing • Purpose: Turning a stream of tokens into a tree structure expressed by grammar • Reasons: • Checking that input is well-formed (according to given grammar) • Producing parse tree or abstract syntax tree to recover tree structure in input • Processing parse tree according to grammar

Parsing combinators • Idea: For each terminal or nonterminal M there is a function: • fM : token list -> T * token list (= T phrase) • such that fM takes elements from its argument until it has reduced the elements to M • and then produces a value of type T for it.

Parsing primitives • Terminals: • Var: string phrase • Const: int phrase • $: string -> string phrase (for keywords)

Parsing primitives... • Parsing combinators: • empty: (‘a list) phrase • ||: ‘a phrase * ‘a phrase -> ‘a phrase • --: ‘a phrase * ‘b phrase -> (‘a * ‘b) phrase • >>: ‘a phrase * (‘a -> ‘b) -> ‘b phrase • Derived combinators: • repeat: ‘a phrase -> ‘a list phrase • $--: ‘a phrase * ‘b phrase -> ‘b phrase • --$: ‘a phrase * ‘b phrase -> ‘a phrase

Parsing precedences infix 6 $-- --$ infix 5 -- infix 3 >> infix 0 ||

Problems with combinatory parsers • Left-recursion: • Problem: Left-recursive grammars make parsers go into an infinite loop. • Remedy: Transform grammar to eliminate left-recursion • Mutual recursion: • Problem (SML-specific!): Cannot use val-declaration and combinator applications only. • Remedy: Use fun-declarations for mutually recursive parts of a grammar

Data type for abstract syntax trees type binop = string datatype expAST = EXP of termAST * (binop * termAST) list and termAST = TERM of factorAST * (binop * factorAST) list and factorAST = VAR of string | CONST of int | PARENEXP of expAST

Parser: example (first try) val binop1 = $”+” || $”-” val binop2 = $”*” | $”/” val factor = Var >> VAR || Const >> CONST o Int.fromString || $”(” $-- exp --$ $”)” >> PARENEXP val term = factor -– repeat (binop2 -- factor) >> TERM val exp = term –- repeat (binop1 term) >> EXP PROBLEM: Doesn’t work! These definitions are intended to be mutually recursive, but are not!

Parser: example (second try) val binop1 = $”+” || $”-” val binop2 = $”*” | $”/” fun factor toks = ( Var >> VAR || Const >> CONST || $”(” $-- exp --$ $”)” ) toks and term toks = (factor -– repeat (binop2 -- factor)) toks and exp toks = (term -– repeat (binop1 term)) toks

Operator precedence parsing (overview) • When processing operator expressions, a parser has to decide whether to reduce (stop the current phrase parser and return its result) or shift (continue the current phrase parse) • Operator precedence parsing: Associate a precedence (binding strength) with each operator, remember the the precedence of the last operator processed and determine whether to reduce or shift depending on the precedence of the next operator. • See Paulson, pp. 364-366

Backtracking parsing (overview) • There may be more than one of parsing an expression. • Backtracking parsing: Construct a lazy list of all possible parses of a token stream. Continue parse with first of those and find a complete parse for the whole token stream; if that fails, backtrack to second in the list and repeat. • See Paulson, pp. 366-367

Recursive-descent parsing (overview) • Write one parser for each grammatical category (as in combinatory parsing) • Process token stream as in combinatory parsers, excepting alternatives. • Process alternatives as follows: • Look at next token (first token of remaining token stream). • Choose phrase parser on the basis of that token.

LL-parsing and LR-parsing (overview) • Use tools to generate parsers from grammar specifications. • Produces a table that guides a push-down automaton through parsing actions (“shift”, “reduce”) • LL-parsing: Predictive (basically recursive descent parsing in table-driven form) • LR-parsing (incl. SLR- and LALR-parsing): (Virtual) parallel execution of phrase parsers. • Problems: Lookahead bounded in practice, at times unwieldy.

Functional Design and Programming