530 likes | 559 Views
CYK Parser. Von Carla und Cornelia Kempa. Overview. C ocke Y ounger K asami -method. Recognition phase. Example grammar. Number(s) Integer | Real Integer Digit | Integer Digit Real Integer Fraction Scale Fraction . Integer Scale e Sign Integer | Empty
E N D
CYK Parser Von Carla und Cornelia Kempa
Example grammar • Number(s) Integer | Real • Integer Digit | Integer Digit • Real Integer Fraction Scale • Fraction .Integer • Scale e Sign Integer | Empty • Digit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 • Empty ɛ • Sign + | -
Example Sentence: 32.5e+1 • 1. concentrate on the substrings of the input sentence
32.5e +1 is in the language • What problems can we already see in this example?
Another complication: Ɛ- rules Input : 43.1
The ɛ- Problem Shortest substrings of any input sentence : ɛ-substrings We must compute Rɛ the set of non-terminals that derive ɛ Rɛ = { Empty, Scale }
Non- empty substrings of the input sentence • Input : z = z1 z2 z3 z4 ….zn • Compute the set of Non-Terminals that derive the substring of z starting at position i, of length l.
Terminology (also on the handout) • i index we are starting at • l length of this substring • R s i,l set of Non-Terminals deriving the substring s i, l • S i, 0 = ɛ • Set of Non- Terminals that derive ɛ : R s i,0 = R ɛ
The set of Non- Terminals deriving the substring s i, l : R si, l 1.) substrings of length 0 S i, 0 = ɛ and R si, l = Rɛ 2.) short substrings 3.) longer substrings (say l = j ) All the information on substrings with l < j is available
Check each RH-side (Right-Hand -side) in the grammar to see if it derives s i, l • L A1 ….Am S i, l ( divided into m segments (= possibly empty)) A1 first segment of s i, l A2 second segment of s i, l …. ….
A 1 ….Am s i,l • So A1 first part of s i,l (let´s say A1 has to derive a first part of s i, l of length k) A1 s i, k A1 is in the set R s i,k
A 1 ….Am s i,l • Assuming this A2…Am has to derive the rest: A2 … Am Si+k, l-k This is attempted for every k
Problems with this Approach 1) Consider A2…Am m could be 1 and A1 a Non-terminal We are Dealing with a unit- rule A1 must derive the whole substring s i, l and thus be a member of R s i, l But that´s the set we are computing right now …
Solution to this problem • A1 s i, l • Somewhere along the derivation there must be a first step not using a unit rule A1 B … C * s i, l C is the first Non-Terminal using a non-unit-rule in the derivation
Solution cont. At some stage C is added to Rs i, l If we repeat the process again and again At some point B will be added and in the next step A1 will be added We have to repeat the process again and again until no new Non-Terminals are added to R s i,l
Problem 2 Ɛ-rules Consider all but one of the At derive Ɛ B A1 A2 A3 A4 A5 …. At B and A1 - t are Non-Terminals A2 – At derive Ɛ So what stays is : B A1 A unit-rule
We have computed all the Rs i,l • If S is a member of Rs 1, n the start symbol derives z (=s 1, n) (the input string)
CYK recognition with a grammar in ****- form: • What are the Restrictions we want to have on our grammar ?
Useful Restrictions • No ɛ- rules • No unit-rules • Limit the length of the right- hand side of each rule, say to two • What we get out of this: • A a • A BC • Where a is a terminal and ABC are Non- Terminals
Chomsky-Normal-Form… (… not only to annoy students ) • Perfect grammar for CYK
How CYK works for a grammar in CNF • Rɛ is empty • R s i, 1 can be read directly from the rules (A a) A rule A BC can never derive a single terminal
Procedure • Iteratively (as before) : • 1) Fill the sets R s, 1 directly • 2) Process all substrings of length 1 • 3) Process all substrings of length 2 • 4) Process all substrings of length l • For the first step we use the rules of the form A a • For all the following steps we have to use the rules of the form: A BC
CYK and CNF Question the CYK-Parser has to answear is: Does such a k exist?
Answearing this question is easy: • Just try all possibilities • no problem since you are a computer ;-) • Range : from 1 to (l-1) • All the sets R s i,k and R s i+k , l-1 have already been computed at this point
Transform our sample CF-grammar into Chomsky Normal Form • Overview • 1) eliminate ɛ-rules • 2) eliminate unit-rules • 3) remove non-productive non-terminals • 4) remove non –reachable non-terminals • 5) modify the rest until all grammar rules are of the form A a , A BC
Our number grammar in CNF • Number(s) 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 • Number(s) Integer Digit • Number(s) N1 Scale´ | Integer Fraction • N1 Integer Fraction • Integer 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 • Integer Integer Digit • Fraction T1 Integer • T1 . • Scale ´ N2 Integer • N2 T2 Sign • T2 e • Digit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 • Sign + | -
Building the recognition table • Input : Our example grammar in CNF input sentence: 32.5 e + 1
Building the recognition table • 1) bottom-row : read directly from the grammar (rules of the form A a ) • 2) Check each RHS in the grammar
Check each RHS of the grammar • Two Ways: Example: 2.5 e ( = s 2, 4) • 1) check each RHS e.g N1 Scale´ • 2) compute possible RH-Sides from the recognition table
How this is done 1) N1 not in R s 2, 1 or R s 2, 2 N1 is a member of R s 2, 3 But Scale´ is not a member of R s 5, 1 2) R s 2, 4 is the set of Non- Terminals that have a RHS AB where either: A in R s 2, 1 and B in R s 3, 3 A in R s 2, 2 and B in R s 4, 2 A in R s 2, 3 and B in R s 5, 1 Possible combinations: N1 T2 or Number T2 In our grammar we do not have such a RHS, so nothing is added to R s 2, 4.
Computing R s i, l:follow the arrows V and W simultaneously A BC , B a member of a set on the V arrow , C a member of a set on the W arrow
Comparison • This process is much less complicated than the one we saw before • Why?
Conclusion • This process is much less complicated • Reasons: 1) We do not have to repeat the process again and again until no new Non-Terminals are added to R s i,l (The substrings we are dealing with are really substrings and cannot be equal to the string we start with)
Reasons cont. 2) We only have to find one place where the substring must be split into two A B C Here !