290 likes | 444 Views
Fall 2009. The Chinese University of Hong Kong. CSC 3130: Automata theory and formal languages. Parsers for programming languages. Andrej Bogdanov http://www.cse.cuhk.edu.hk/~andrejb/csc3130. CFG of the java programming language. Identifier: IDENTIFIER QualifiedIdentifier:
E N D
Fall 2009 The Chinese University of Hong Kong CSC 3130: Automata theory and formal languages Parsers for programming languages Andrej Bogdanov http://www.cse.cuhk.edu.hk/~andrejb/csc3130
CFG of the java programming language Identifier: IDENTIFIER QualifiedIdentifier: Identifier { . Identifier } Literal: IntegerLiteral FloatingPointLiteral CharacterLiteral StringLiteral BooleanLiteral NullLiteral Expression: Expression1 [AssignmentOperator Expression1]] AssignmentOperator: = += -= *= /= &= |= … from http://java.sun.com/docs/books/jls /second_edition/html/syntax.doc.html#52996
Parsing java programs class Point2d { /* The X and Y coordinates of the point--instance variables */ private double x; private double y; private boolean debug; // A trick to help with debugging public Point2d (double px, double py) { // Constructor x = px; y = py; debug = false; // turn off debugging } public Point2d () { // Default constructor this (0.0, 0.0); // Invokes 2 parameter Point2D constructor } // Note that a this() invocation must be the BEGINNING of // statement body of constructor public Point2d (Point2d pt) { // Another consructor x = pt.getX(); y = pt.getY(); } } … Simple java program: about 1000 symbols
Parsing algorithms • How long would it take to parse this? • Can we parse faster? • No! CYK is the fastest known general-purposeparsing algorithm exhaustive algorithm about 1080 years (longer than life of universe) CYK algorithm about 1 week!
Another way of thinking Scientist: Find an algorithm thatcan parse strings inany grammar Engineer: Design your grammar so it has a very fastparsing algorithm
An example Stack Input S Tc(1) T TA(2) | A(3) A aTb(4) | ab(5) Action a ab A T Ta Taa Taab TaA TaT TaTb TA T Tc S abaabbc baabbc aabbc aabbc aabbc abbc bbc bc bc bc c c c shift shift reduce (5) reduce (3) shift shift shift reduce (5) reduce (3) shift reduce (4) reduce (2) shift reduce (1) input: abaabbc S T A T T A A a a b b c a b
Items A aTb(4) S Tc(1) T TA(2) T A(3) A ab(5) A •aTb A a•Tb A aT•b A aTb• A •ab A a•b A ab• T •A T A• S •Tc S T•c S Tc• T •TA T T•A T TA• Stack Input Action a ab A T Ta abaabbc baabbc aabbc aabbc aabbc abbc shift shift reduce (5) reduce (3) shift shift • • • • • • Idea of parsing algorithm: Try to match complete items to top of stack
Some terminology Stack Input S Tc(1) T TA(2) | A(3) A aTb(4) | ab(5) Action a ab A T Ta Taa Taab TaA TaT TaTb TA T Tc S abaabbc baabbc aabbc aabbc aabbc abbc bbc bc bc bc c c c shift shift reduce (5) reduce (3) shift shift shift reduce (5) reduce (3) shift reduce (4) reduce (2) shift reduce (1) input: abaabbc handle valid items: a•Tb, a•b valid items: T•a, T•c, aT•b
Outline of LR(0) parsing algorithm • As the string is being read, it is pushed on a stack • Algorithm keeps track of all valid items • Algorithm can perform two actions: no complete itemis valid there is one valid item,and it is complete reduce shift
Running the algorithm Valid Items Input A Stack a aa aab aA aAb A aabb abb bb b b A •aAb A •ab A a•Ab A a•b A •aAb A •ab A a•Ab A a•b A •aAb A •ab A ab• A aA•b A aAb• S S S R S R A aAb | ab A aAb aabb
How to update valid items • Initial set of valid items • Updating valid items on “shift b” • After these updates, for every valid item A a•Cb andproductionC •d, we also addas a valid item S •a for every production S a is updated to A ab•b A a•bb disappears if X ≠ b A a•Xb a, b: terminals A, B: variables X, Y: mixed symbols a, b: mixed strings notation C •d
How to update valid items • Updating valid items on “reduce b to B” • First, we backtrack to valid items before reduce • Then, we apply same rules as for “shift B” (as if B were a terminal) is updated to A aB•b A a•Bb disappears if X ≠ B A a•Xb C •d is added for every valid item A a•Cband productionC •d
Viable item updates by NFA • States of NFA will be items (plus a start state q0) • For every item S •a we have a transition • For every item A •X we have a transition • For every item A a•Cb and production C •d e q0 S •a X A •X A X• e A •C C •d
Example A aAb | ab a A A aA•b A a•Ab A •aAb b q0 A aAb• A ab• A •ab A a•b a b
Convert NFA to DFA a 2 A a•Ab A a•b A •aAb A •ab 1 4 A a A •aAb A •ab A aA•b b b 5 3 A ab• A aAb• die states correspond to sets of valid items transitions are labeled by variables / terminals
Shift states and reduce states a 2 A a•Ab A a•b A •aAb A •ab 1 4 A a A •aAb A •ab A aA•b b b 5 3 A ab• A aAb• are shift states 1 2 4 are reduce states 3 5
Attempt at parsing with DFA DFA state Input A Stack a aa aab aA aabb abb bb b b A •aAb A •ab A a•Ab A a•b A •aAb A •ab A a•Ab A a•b A •aAb A •ab A ab• A aA•b 1 2 2 3 ? S S S R A aAb | ab A aAb aabb
Remember the state in stack! DFA state Input A Stack 1 1a2 1a2a2 1a2a2b3 1a2A4 1a2A4b5 1A aabb abb bb b b A •aAb A •ab A a•Ab A a•b A •aAb A •ab A a•Ab A a•b A •aAb A •ab A ab• A aA•b A aAb• 1 2 2 3 4 5 S S S R S R A aAb | ab A aAb aabb
Reconstructing the parse tree DFA state Input A Stack 1 12 122 1223 124 1245 1 aabb abb bb b b A •aAb A •ab A a•Ab A a•b A •aAb A •ab A a•Ab A a•b A •aAb A •ab A ab• A aA•b A aAb• 1 2 2 3 4 5 S S S R S R A A a a b b • • • • • A aAb | ab A aAb aabb
LR(0) grammars and deterministic PDAs • The parsing procedure can be implemented by adeterministic pushdown automaton • A PDA is deterministic if in every state there is atmost one possible transition • for every input symbol and pop symbol, including e • Example: PDA for w#wR is deterministic, but PDA forwwR is not
LR(0) grammars and deterministic PDAs • Not every PDA can be made deterministic • Since PDAs are equivalent to CFGs, LR(0) parsing algorithm must fail for some CFG, e.g. • Why does LR(0) parsing algorithm fail? L = {wwR : w ∈ {a, b}*}
Example 1 L = {wwR : w ∈ {a, b}*} A aAa | bAb | e a A a e A •aAa A a•Aa A aA•a A aAa• e e e e q0 A • e e e b A b e A •bAb A b•Ab A bA•b A bAb•
Example 1 L = {wwR : w ∈ {a, b}*} A aAa | bAb | e a, b a A aAa• A a•Aa A b•Ab A •aAa A aA•a a, b A A •bAb A •aAa A bA•b A • A •bAb A • b A bAb• shift-reduce conflict
Example 1 L = {wwR : w ∈ {a, b}*} A aAa | bAb | e a, b a A aAa• A a•Aa A b•Ab A •aAa A aA•a a, b A A •bAb A •aAa A bA•b A • A •bAb A • b A bAb• shift or reduce? shift or reduce? input: abba
When you can’t LR(0) parse • Algorithm can perform two actions: • What if: no complete itemis valid there is one valid item,and it is complete reduce (R) shift (S) some valid itemscomplete, some not more than one validcomplete item R / R conflict S / R conflict
Example 2 L = {w#wR : w ∈ {a, b}*} A aAa | bAb | # a A a e A •aAa A a•Aa A aA•a A aAa• e e e e # q0 A •# A #• e e e b A b e A •bAb A b•Ab A bA•b A bAb•
Example 2 L = {wwR : w ∈ {a, b}*} A aAa | bAb | e a, b 4 2 a A aAa• 1 A a•Aa 3 A b•Ab A •aAa A aA•a a, b A A •bAb A •aAa A bA•b A •# A •bAb 5 A •# b A bAb• # # 6 A #• No S/R or R/R conflicts!
Example 2: parsing State A Stack 1 12 122 1226 1223 12234 123 1236 1 1 2 2 6 3 4 3 5 S S S R S R S R A A A # b a a b • • • • • •
Hierarchy of context-free grammars context-free grammars parse using CYK algorithm (slow) LR(∞) grammars … to be continued… java perl python … LR(1) grammars LR(0) grammars parse using LR(0) algorithm