400 likes | 411 Views
Learn about parsing algorithms, LR(0) grammars, and the process of parsing computer programs using these techniques. Discover how to optimize parsing speed and understand the concept of valid items and parse tree construction.
E N D
Fall 2011 The Chinese University of Hong Kong CSCI 3130: Automata theory and formal languages LR(0) grammars Andrej Bogdanov http://www.cse.cuhk.edu.hk/~andrejb/csc3130
Parsing computer programs • First the javac compiler does a lexical analysis: if (n == 0) { return x; } if (ID == INT_LIT) { return ID; } ID = identifier (name of variable, procedure, class, ...) INT_LIT = integer literal (value) • The alphabet of java CFG consists of symbols like: S = {if, return, (, ){, }, ;, ==, ID, INT_LIT, ...}
Parsing computer programs if (n == 0) { return x; } Statement if (ID == INT_LIT) { return ID; } ifParExpressionStatement (Expression) Block Expression ExpressionRest { BlockStatements } BlockStatement Primary Infixop Expression Statement Identifier == Primary return Expression ; ID Literal Primary INT_LIT Identifier the parse tree of a java statement ID
CFG of the java programming language Identifier: ID QualifiedIdentifier: Identifier { . Identifier } Literal: IntegerLiteral FloatingPointLiteral CharacterLiteral StringLiteral BooleanLiteral NullLiteral Expression: Expression1 [AssignmentOperator Expression1]] AssignmentOperator: = += -= *= /= &= |= … from http://java.sun.com/docs/books/jls /second_edition/html/syntax.doc.html#52996
Parsing java programs class Point2d { /* The X and Y coordinates of the point--instance variables */ private double x; private double y; private boolean debug; // A trick to help with debugging public Point2d (double px, double py) { // Constructor x = px; y = py; debug = false; // turn off debugging } public Point2d () { // Default constructor this (0.0, 0.0); // Invokes 2 parameter Point2D constructor } // Note that a this() invocation must be the BEGINNING of // statement body of constructor public Point2d (Point2d pt) { // Another consructor x = pt.getX(); y = pt.getY(); } } … Simple java program: about 500 symbols
Parsing algorithms • How long would it take to parse this program? • Can we parse faster? • No! CYK is the fastest known general-purposeparsing algorithm for CFGs try all parse trees about 1080 years CYK algorithm about 1 week!
Another way of thinking Scientist: Find an algorithm thatcan parse any CFG Engineer: Design your CFG so it can be parsedvery quickly
Parsing left to right S Tc(1) T TA(2) | A(3) A aTb(4) | ab(5) input: abaabbc S T A T a a b b c a b • • • • • • • • T A A Try to match to the left of •
Items • An item is a production augmented with a • • The item is complete if the • is the last symbol A aTb(4) S Tc(1) T TA(2) T A(3) A ab(5) A •aTb A a•Tb A aT•b A aTb• A •ab A a•b A ab• T •A T A• S •Tc S T•c S Tc• T •TA T T•A T TA•
Meaning of items S Tc(1) T TA(2) | A(3) A aTb(4) | ab(5) A a•Tb A a•b a a b b c a b • A aT•b T a a b b c a b • T A A Items represent possibilities at various stages of the parsing process
Meaning of items S Tc(1) T TA(2) | A(3) A aTb(4) | ab(5) A a•Ab A a•b a a b b c a b • A aT•b A aTb• A T T a a b b c a b • a a b b c a b • T T A A A A When a complete item occurs, a part of the parse tree is discovered
LR(0) parsing Move from left to right A aAb | ab Keep track of all possible valid items Prune the invalid items When a complete item occurs, build part of parse tree valid items registry A •ab A •aAb a a b b •
LR(0) parsing Move from left to right A aAb | ab Keep track of all possible valid items Prune the invalid items When a complete item occurs, build part of parse tree valid items registry A a•b A a•Ab a a b b • A •ab A •aAb
LR(0) parsing Move from left to right A aAb | ab Keep track of all possible valid items Prune the invalid items When a complete item occurs, build part of parse tree A a•Ab valid items registry A a•b A a•Ab a a b b • A a•b A a•Ab
LR(0) parsing Move from left to right A aAb | ab Keep track of all possible valid items Prune the invalid items When a complete item occurs, build part of parse tree A a•Ab valid items registry A a•b A a•Ab a a b b • A •ab A •aAb
LR(0) parsing Move from left to right A aAb | ab Keep track of all possible valid items Prune the invalid items When a complete item occurs, build part of parse tree A a•Ab A a•Ab valid items registry A ab• A a•Ab a a b b • A •ab A •aAb
LR(0) parsing Move from left to right A aAb | ab Keep track of all possible valid items Prune the invalid items When a complete item occurs, build part of parse tree A A a•Ab valid items registry A aA•b A ab• a a b b •
LR(0) parsing Move from left to right A aAb | ab Keep track of all possible valid items Prune the invalid items When a complete item occurs, build part of parse tree A A valid items registry A aAb• a a b b •
two kinds of actions A a•b A ab• A a•Ab A •aAb A •ab a a b b • • A no complete itemin registry exactly one complete item valid items registry valid items registry a a b b • shift reduce
LR(0) implementation: first take action valid items stack A aAb | ab S A •aAb A •ab A a•Ab A a•b A •aAb A •ab a S a a b b A A a•Ab A a•b A •aAb A •ab S aa aab R A ab• A S aA A aA•b R aAb A aAb• • • • • • A
valid item update rules a, b: terminals A, B, C: variables a, b, d: mixed strings notation initial valid items: S •a read a A aa•b A a•ab shift updates: read b disappear A a•ab read x disappear A a•Bb A aB•b A a•Bb reduce updates: reduce B disappear A a•Bb reduce C common updates: B •d A a•Bb
valid item update rules a, b: terminals A, B, C: variables a, b, d: mixed strings X: terminal or variable notation initial valid items: S •a read a A aa•b A a•ab shift updates: A aB•b A aX•b A a•Bb A a•Xb reduce updates: reduce B X common updates: B •d A a•Bb
LR(0) parsing: NFA representation a, b: terminals A, B, C: variables a, b, d: mixed strings X: terminal or variable notation e q0 S •a For every item S •a X A •X A X• For every item A •X e A •C C •d For every pair of items A •C, C •d
NFA example A •aAb A a•Ab A aA•b A aAb• q0 a b A A •ab A a•b A ab• A aAb | ab NFA alphabet is S = {a, b, A} a start state is q0 b other states are items
NFA to DFA conversion A •aAb A a•Ab A aA•b A aAb• q0 a a b b A A A •ab A a•b A ab• a A a•Ab A a•b A •aAb A •ab A aA•b A aAb• a A •aAb A •ab b b a, A qdie A ab• b, A qdie
Shift states and reduce states 3 2 1 5 4 b a A a A a•Ab A a•b A •aAb A •ab are shift states 1 2 3 A aA•b A aAb• A •aAb A •ab b are reduce states 4 5 A ab•
LR(0) parsing: second take 4 3 2 1 5 b a A a a b b action state stack 1 S A aAb | ab a a 2 S A a•Ab A a•b A •aAb A •ab A aA•b A aAb• How do we know what to reduce to? aa 2 S A •aAb A •ab ? b aab 5 R A ab• • • • •
remember state in stack! 5 3 2 1 4 b a A a a b b action state stack 1 S A aAb | ab a A 1a 2 S A a•Ab A a•b A •aAb A •ab A aA•b A aAb• backtrack two steps 1a2a 2 S A •aAb A •ab b 1a2a2b 5 R A ab• 1a2A • • • •
remember state in stack! 4 5 3 2 1 a b A a a b b A action state stack 1 S A aAb | ab a A 1a 2 S A a•Ab A a•b A •aAb A •ab A aA•b A aAb• 1a2a 2 S A •aAb A •ab b 1a2a2b 5 R A ab• 3 S 1a2A • • 4 R 1a2A3b
PDA for LR(0) parsing 4 5 3 2 1 a b A a a b b A action state stack 1 S A aAb | ab a A 1 2 S A a•Ab A a•b A •aAb A •ab A aA•b A aAb• 12 2 S A •aAb A •ab b 122 5 R A ab• 3 S 12 • • • • • 4 R 123
PDA for LR(0) parsing •A e, e/$ 4 5 3 1 2 a b A A, $/e a A• A a•Ab A a•b A •aAb A •ab A aA•b A aAb• pop stateb, statea pop stateb, stateA, statea A •aAb A •ab b take transition A out of statea take transition A out of statea push statea push statea A ab•
Example 1 L = {w#wR : w ∈ {a, b}*} A aAa | bAb | # a a A e A •aAa A a•Aa A aA•a A aAa• e e e e # q0 A •# A #• e e e b b A e A b•Ab A bA•b A bAb• A •bAb
Example 1 A aAa | bAb | e stack state action e 1 S A •aAa 5 6 1 3 7 4 8 2 1 4 S # A •bAb A #• 14 3 S A •# 143 2 R b # 143 5 S a A # 1435 7 R A a•Aa A b•Ab A 14 6 S a A •aAa A •aAa 146 8 R a b b A •bAb A •bAb A •# A •# A A A aA•a A bA•b A a a 4 4 A aAa• A bAb• # b a a b • • • • • •
LR(0) grammars and deterministic PDAs • The PDA for LR(0) parsing is deterministic • Some CFLs require non-deterministic PDAs, e.g. • What goes wrong when we do LR(0) parsing on L? L = {wwR : w ∈ {a, b}*}
Example 2 L = {wwR : w ∈ {a, b}*} A aAa | bAb | e a a A e A •aAa A a•Aa A aA•a A aAa• e e e e q0 A • e e e b b A e A b•Ab A bA•b A bAb• A •bAb
Example 2 L = {wwR : w ∈ {a, b}*} A aAa | bAb | e A •aAa A •bAb A • b a A a•Aa A b•Ab a A •aAa A •aAa b A •bAb A •bAb A • A • A A A aA•a A bA•b input: abba a a shift-reduce conflict 4 4 A aAa• A bAb•
Parsing computer programs if (n == 0) { return x; } else { return x + 1; } Statement ifParExpressionStatement (Expression) Block Expression ExpressionRest { BlockStatements } BlockStatement Primary Infixop Expression Statement Identifier == Primary return Expression ; ID Literal Primary INT_LIT Identifier ID
Parsing computer programs if (n == 0) { return x; } else { return x + 1; } Statement ifParExpressionStatement elseStatement Block Block (Expression) ... ... ... LR(0) parsers cannot tell apart if ... then from if ... then ... else ID
When you can’t LR(0) parse • LR(0) parser can perform two actions: • What if: no complete itemis valid there is one valid item,and it is complete reduce (R) shift (S) some valid itemscomplete, some not more than one validcomplete item R / R conflict S / R conflict
Hierarchy of context-free grammars context-free grammars LR(∞) grammars … to be continued… java perl python … LR(1) grammars LR(0) grammars parse using LR(0) algorithm