760 likes | 1.04k Views
Introduction to Parsing. Outline. Regular languages revisited Parser overview Context-free grammars (CFG ’ s) Derivations. Languages and Automata. Formal languages are very important in CS Especially in programming languages Regular languages The weakest formal languages widely used
E N D
Outline • Regular languages revisited • Parser overview • Context-free grammars (CFG’s) • Derivations
Languages and Automata • Formal languages are very important in CS • Especially in programming languages • Regular languages • The weakest formal languages widely used • Many applications • We will also study context-free languages
Limitations of Regular Languages • Intuition: A finite automaton that runs long enough must repeat states • Finite automaton can’t remember # of times it has visited a particular state • Finite automaton has finite memory • Only enough to store in which state it is • Cannot count, except up to a finite limit • E.g., language of balanced parentheses is not regular: { (i )i | i ¸ 0}
The Functionality of the Parser • Input: sequence of tokens from lexer • Output: parse tree of the program
?: INT INT == ID ID Example • Java expr x == y ? 1 : 2 • Parser input ID == ID ? INT : INT • Parser output
The Role of the Parser • Not all sequences of tokens are programs . . . • . . . Parser must distinguish between valid and invalid sequences of tokens • We need • A language for describing valid sequences of tokens • A method for distinguishing valid from invalid sequences of tokens
Context-Free Grammars • Programming language constructs have recursive structure • An EXPR is if EXPR then EXPR else EXPR fi , or while EXPR loop EXPR pool , or … • Context-free grammars are a natural notation for this recursive structure
CFGs (Cont.) • A CFG consists of • A set of terminals T • A set of non-terminals N • A start symbolS (a non-terminal) • A set of productions AssumingX 2 N X !e , or X ! Y1 Y2 ... Yn where Yiµ N [ T
Notational Conventions • In these lecture notes • Non-terminals are written upper-case • Terminals are written lower-case • The start symbol is the left-hand side of the first production
Examples of CFGs Expr if Expr then Expr else Expr | while Expr do Expr | id
Examples of CFGs Simple arithmetic expressions:
The Language of a CFG Read productions as replacement rules: X ! Y1 ... Yn Means X can be replaced by Y1 ... Yn X !e Means X can be erased (replaced with empty string)
Key Idea • Begin with a string consisting of the start symbol “S” • Replace any non-terminal Xin the string by a right-hand side of some production • Repeat (2) until there are no non-terminals in the string
The Language of a CFG (Cont.) More formally, write if there is a production
The Language of a CFG (Cont.) Write if in 0 or more steps
The Language of a CFG Let Gbe a context-free grammar with start symbol S. Then the language of G is:
Terminals • Terminals are called because there are no rules for replacing them • Once generated, terminals are permanent • Terminals ought to be tokens of the language
Examples L(G) is the language of CFG G Strings of balanced parentheses Two grammars: OR
Arithmetic Example Simple arithmetic expressions: Some elements of the language:
Notes The idea of a CFG is a big step. But: • Membership in a language is “yes” or “no”; also need parse tree of the input • Must handle errors gracefully • Need an implementation of CFG’s (e.g., Cup)
More Notes • Form of the grammar is important • Many grammars generate the same language • Tools are sensitive to the grammar • Note: Tools for regular languages (e.g., flex) are also sensitive to the form of the regular expression, but this is rarely a problem in practice
Derivations and Parse Trees A derivation is a sequence of productions A derivation can be drawn as a tree • Start symbol is the tree’s root • For a production add children • to node
Derivation Example • Grammar • String
Derivation Example (Cont.) E E + E E * E id id id
Derivation in Detail (2) E E + E
Derivation in Detail (3) E E + E E * E
Derivation in Detail (4) E E + E E * E id
Derivation in Detail (5) E E + E E * E id id
Derivation in Detail (6) E E + E E * E id id id
Notes on Derivations • A parse tree has • Terminals at the leaves • Non-terminals at the interior nodes • An in-order traversal of the leaves is the original input • The parse tree shows the association of operations, the input string does not
Left-most and Right-most Derivations • The example is a left-most derivation • At each step, replace the left-most non-terminal • There is an equivalent notion of a right-most derivation
Right-most Derivation in Detail (3) E E + E id
Right-most Derivation in Detail (4) E E + E E * E id
Right-most Derivation in Detail (5) E E + E E * E id id
Right-most Derivation in Detail (6) E E + E E * E id id id
Derivations and Parse Trees • Note that right-most and left-most derivations have the same parse tree • The difference is the order in which branches are added
Summary of Derivations • We are not just interested in whether s 2L(G) • We need a parse tree for s • A derivation defines a parse tree • But one parse tree may have many derivations • Left-most and right-most derivations are important in parser implementation
Issues • A parser consumes a sequence of tokens s and produces a parse tree • Issues: • How do we recognize that s 2L(G) ? • A parse tree of s describes hows L(G) • Ambiguity: more than one parse tree (interpretation) for some string s • Error: no parse tree for some string s • How do we construct the parse tree?
Ambiguity • Grammar E ! E + E | E * E | ( E ) | int • String int * int + int
Ambiguity (Cont.) This string has two parse trees E E E + E E E * E E int int E + E * int int int int
Ambiguity (Cont.) • A grammar is ambiguous if it has more than one parse tree for some string • Equivalently, there is more than one right-most or left-most derivation for some string • Ambiguity is bad • Leaves meaning of some programs ill-defined • Ambiguity is common in programming languages • Arithmetic expressions • IF-THEN-ELSE
Dealing with Ambiguity • There are several ways to handle ambiguity • Most direct method is to rewrite the grammar unambiguously E ! T + E | T T ! int * T | int | ( E ) • Enforces precedence of * over +
Ambiguity: The Dangling Else • Consider the grammar E if E then E | if E then E else E | OTHER • This grammar is also ambiguous
if if E1 if E1 E4 if E2 E3 E2 E3 E4 The Dangling Else: Example • The expression if E1 then if E2 then E3 else E4 has two parse trees • Typically we want the second form
The Dangling Else: A Fix • else matches the closest unmatched then • We can describe this in the grammar E MIF /* all then are matched */ | UIF /* some then are unmatched */ MIF if E then MIF else MIF | OTHER UIF if E then E | if E then MIF else UIF • Describes the same set of strings