CYK Parser

CYK Parser Von Carla und Cornelia Kempa

Overview

Cocke Younger Kasami -method

Recognition phase

Example grammar • Number(s)  Integer | Real • Integer  Digit | Integer Digit • Real  Integer Fraction Scale • Fraction  .Integer • Scale  e Sign Integer | Empty • Digit  0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 • Empty  ɛ • Sign  + | -

Example Sentence: 32.5e+1 • 1. concentrate on the substrings of the input sentence

Building the recognition table

32.5e +1 is in the language • What problems can we already see in this example?

Another complication: Ɛ- rules Input : 43.1

The ɛ- Problem Shortest substrings of any input sentence : ɛ-substrings We must compute Rɛ the set of non-terminals that derive ɛ Rɛ = { Empty, Scale }

Non- empty substrings of the input sentence • Input : z = z1 z2 z3 z4 ….zn • Compute the set of Non-Terminals that derive the substring of z starting at position i, of length l.

Terminology (also on the handout) • i index we are starting at • l length of this substring • R s i,l set of Non-Terminals deriving the substring s i, l • S i, 0 = ɛ • Set of Non- Terminals that derive ɛ : R s i,0 = R ɛ

S i, l =z i z i+1 …… z i+ l-1

The set of Non- Terminals deriving the substring s i, l : R si, l 1.) substrings of length 0 S i, 0 = ɛ and R si, l = Rɛ 2.) short substrings 3.) longer substrings (say l = j ) All the information on substrings with l < j is available

Check each RH-side (Right-Hand -side) in the grammar to see if it derives s i, l • L  A1 ….Am S i, l ( divided into m segments (= possibly empty)) A1  first segment of s i, l A2  second segment of s i, l …. ….

A 1 ….Am  s i,l • So A1  first part of s i,l (let´s say A1 has to derive a first part of s i, l of length k) A1  s i, k A1 is in the set R s i,k

A 1 ….Am  s i,l • Assuming this A2…Am has to derive the rest: A2 … Am  Si+k, l-k This is attempted for every k

Problems with this Approach 1) Consider A2…Am m could be 1 and A1 a Non-terminal  We are Dealing with a unit- rule A1 must derive the whole substring s i, l and thus be a member of R s i, l But that´s the set we are computing right now …

Solution to this problem • A1  s i, l • Somewhere along the derivation there must be a first step not using a unit rule A1  B … C * s i, l C is the first Non-Terminal using a non-unit-rule in the derivation

Solution cont. At some stage C is added to Rs i, l If we repeat the process again and again At some point B will be added and in the next step A1 will be added  We have to repeat the process again and again until no new Non-Terminals are added to R s i,l

Problem 2 Ɛ-rules Consider all but one of the At derive Ɛ B  A1 A2 A3 A4 A5 …. At B and A1 - t are Non-Terminals A2 – At derive Ɛ So what stays is : B  A1 A unit-rule

We have computed all the Rs i,l • If S is a member of Rs 1, n the start symbol derives z (=s 1, n) (the input string)

CYK recognition with a grammar in ****- form: • What are the Restrictions we want to have on our grammar ?

Useful Restrictions • No ɛ- rules • No unit-rules • Limit the length of the right- hand side of each rule, say to two • What we get out of this: • A  a • A  BC • Where a is a terminal and ABC are Non- Terminals

Chomsky-Normal-Form… (… not only to annoy students ) • Perfect grammar for CYK

How CYK works for a grammar in CNF • Rɛ is empty • R s i, 1 can be read directly from the rules (A  a) A rule A  BC can never derive a single terminal

Procedure • Iteratively (as before) : • 1) Fill the sets R s, 1 directly • 2) Process all substrings of length 1 • 3) Process all substrings of length 2 • 4) Process all substrings of length l • For the first step we use the rules of the form A  a • For all the following steps we have to use the rules of the form: A  BC

CYK and CNF Question the CYK-Parser has to answear is: Does such a k exist?

Answearing this question is easy: • Just try all possibilities • no problem since you are a computer ;-) • Range : from 1 to (l-1) • All the sets R s i,k and R s i+k , l-1 have already been computed at this point

Transform our sample CF-grammar into Chomsky Normal Form • Overview • 1) eliminate ɛ-rules • 2) eliminate unit-rules • 3) remove non-productive non-terminals • 4) remove non –reachable non-terminals • 5) modify the rest until all grammar rules are of the form A  a , A  BC

Our number grammar in CNF • Number(s)  0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 • Number(s)  Integer Digit • Number(s)  N1 Scale´ | Integer Fraction • N1  Integer Fraction • Integer 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 • Integer  Integer Digit • Fraction  T1 Integer • T1  . • Scale ´  N2 Integer • N2  T2 Sign • T2  e • Digit  0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 • Sign  + | -

Building the recognition table • Input : Our example grammar in CNF input sentence: 32.5 e + 1

Building the recognition table • 1) bottom-row : read directly from the grammar (rules of the form A a ) • 2) Check each RHS in the grammar

Check each RHS of the grammar • Two Ways: Example: 2.5 e ( = s 2, 4) • 1) check each RHS e.g N1 Scale´ • 2) compute possible RH-Sides from the recognition table

How this is done 1) N1 not in R s 2, 1 or R s 2, 2 N1 is a member of R s 2, 3 But Scale´ is not a member of R s 5, 1 2) R s 2, 4 is the set of Non- Terminals that have a RHS AB where either: A in R s 2, 1 and B in R s 3, 3 A in R s 2, 2 and B in R s 4, 2 A in R s 2, 3 and B in R s 5, 1 Possible combinations: N1 T2 or Number T2 In our grammar we do not have such a RHS, so nothing is added to R s 2, 4.

Recognition table

Recognition table (well-formed substring table)

Computing R s i, l:follow the arrows V and W simultaneously A  BC , B a member of a set on the V arrow , C a member of a set on the W arrow

Comparison • This process is much less complicated than the one we saw before • Why?

Conclusion • This process is much less complicated • Reasons: 1) We do not have to repeat the process again and again until no new Non-Terminals are added to R s i,l (The substrings we are dealing with are really substrings and cannot be equal to the string we start with)

Reasons cont. 2) We only have to find one place where the substring must be split into two A  B C Here !

CYK Parser

CYK Parser

Presentation Transcript

The CYK Algorithm

Parser

LL PARSER

SLR Parser

LALR PARSER

Parser Generators

Predictive Parser

The Parser

The CYK Algorithm

Parsing using CYK Algorithm

Stanford Parser

Lecture 9 Bottom-up parsers Datalog , CYK parser, Earley parser

FCA Parser

CYK Algorithm

Java Parser

LR parser

CYK Parser

The CYK Parsing Method

The Parser

XML Parser