230 likes | 395 Views
Transformation schemes for context-free grammars structural, algorithmic, linguistic applications. Eli Shamir Hebrew university of Jerusalem, Israel ISCOL Haifa university September 2014. O verview. CFG- Devices producing strings & their derivation trees (with weights)
E N D
Transformation schemes for context-free grammarsstructural, algorithmic, linguistic applications Eli Shamir Hebrew university of Jerusalem, Israel ISCOL Haifauniversity September 2014
Overview • CFG- Devices producing strings & their derivation trees (with weights) • Top down schemes transforming the grammars • Driven by rotations operations-tree (BOT) • Preserving derivation trees, semi-ring weights Enhancing: property tests , parsing & optimal tree algorithms: time down to O(n ), space to O(n). • Decomposition of bounded ambiguity grammars (Sam Eilenberg’squestion [SE]) • Non-expansive [NE] (quasi-rational) grammars • Implications to NLP, sequence alignment, … 2
Schemes - simple to subtle • Chomsky’s normal form (CNF) • Elimination of redundant symbols, ε rules • Greibach’s normal form (GNF) (subtle) all rules are ATx. T terminal (lexicalization) GNF destroys derivation trees, however has many applications (structural…) Schemes for sub-classes of CFG (in parsing technology) deterministic, LR(k)…
Context Free Basics 1 Such a grammar G = (V,T,P,S = root) is a well known model to derive/generate a set of terminal strings in TG defines a derivation relation between strings overVUT: One stepxy: y is obtained from x by rewriting a single occurrence of some A by B1..Bkwhen A B1..Bk is production rule in P. Several stepsx y if x x1 …y LA(G) = {wεT | Aw}, L(G)=LS(G), the language generated by G. A derivation is best described by a labeled tree in which the k sons of a node labeled A are labeled B1, .., Bk.
Context Free Basics 2 Ambiguity-deg (Aw) = {number of distinct trees for (Aw), deg (GA)= max deg of (Aw). A-B- defines a partial order on V U T, denoted A>B. it induces a complete order on any branch of a derivation tree. B in G is pumping if B>B'>B. Then B' is also pumping; both belong to the pumping equivalence class [B].
Node Type and Spread Lemma • B Pumping, (ii) C pre-terminal – if NOT {C > B, B pumping} (iii) D spread – D is not pumping but D>B, B pumping. SPREAD LEMMA 1. Pre-terminal C derives a bounded number of bounded terminal strings. 2. In each derivation tree a spread node D derives a bounded sub-tree the leaves of which are terminals or pump nodes. 3. In G, each spread symbol D derives the bounded number of sub-trees, as mentioned in 2.
Non Expansive Grammars G is non-expansive (NE) if no production rule has the form B -B'-B''- where the B's are from the same pumping class Equivalently, no derivation B—B—B— is possible (sideway pumping is forbidden!). NE is the quasi- rational class, the substitution closure of linear grammars[1]. Our BOT scheme simplifies proofs of its known properties and new ones (parsing speed).
Bounded Operation Tree (BOT) BOT Tree-nodes are labeled by: • Current grammar as a product Π=P1…Pk • Current operation SPREAD / CYC / TTR (Depending on the type of the root of P1 or Pk) Determines the children nodes and their labels Root of BOT= #G, Leaves of BOT - linear G(i) Main Claim:each derivation tree for ww.r. to #G is mapped onto derivation tree for ƍww.r. to some G(i), (with the same weight) and V.V.
SPREAD / CYC / TTR Operations Type=SPREAD: Pk is split to U Q(j), the current grammar at j’th child is P1…Pk-1 * Q(j) Type=CYC: Pk is terminal, the (effective) current grammar at the single child is PkP1…Pk-1 Type=TTR, if the root of Pk is pumping: let M=P1…Pk-1 , N=Pk, the top trunk of N is rotated by 180° and mounted on M, so MN M*N^
Top Trunk Rotation of MN to (M*N^) N M* for trees: M M x1 180 y1 x2 x2 N^ y2 y2 x1 y1 EXIT N^ for strings: m x1x2 … n^ …y2y1 …y2y1 m x1x2 … n^ • Figure 1.1
Figure 1.2: TTR For grammars: N grammar (top trunk) M* grammar BB’C B’CB BDB’ B’BD BB^, B^α B root(M), root (M) α All other productions carry over from N to M*; those of M unchanged. The TTR rotation is invertible, one-one onto for the derivation trees, preserving ambiguity in ‘cyclic rotated’ sense.
Termination and Correctness TTR operations dominate the BOT scheme for NE grammars. The E-depth of N^ and of the two sides of the mounted trunk must decrease. The M* factors become taller and thinner until they become linear G(i). [without spread symbols] Claim: each derivation tree for ww.r. to #G is mapped to derivation tree for ƍww.r. to some G(i), (with the same weight) and V.V. ƍw= CYCLIC rotation of w. Holds for each SPREAD/CYC/TTR step!
Tabular Dynamic Prog. For parsing G (CYK/ Earley algorithm for terminal w of length n the table extends to items of rotated intervals [i+1, i+k (mod n), A BC], at the same cost. For linearG(i) total time cost is only O(n ) Space cost is O(n): one or few diagonals of width near k are kept in memory with pointers to few neighbors, enabling table reconstruction. • Just membership, or total weight algorithm, is in the parallel class NC(1), as for finite-state transductions. 2
Example (from [4]) R R • (M)(N)= (uIu )(vJv), u, vε {0.1}* = I = J u= reversal of u, • It has unbounded "direct (product) ambiguity" which increases the time in Earleyalgorithm. But after one TTR step MN is rotated to • (M*)(N^)= (v uIuv )(J) , which has a linear grammar,(of unbounded ambiguity degree) And all product ambiguity trees are rotated to union of trees for the linear M*N^. R R R
Decomposing Bounded Ambiguity SE Claim:Ambiguity-deg(G)= l < ∞. Then L (G) is a bounded-size union of languages of deg 1-grammars. This provides a positive answer to a question Sam Eilenberg posed, c. 1970. "Bounded size" means polynomial in |G|, the size of the grammar G, and l.
Expansive G and Ambiguity G expansive each pump symbol has ambiguity - degree=1or unbounded (exponential in length) B==> --B—B—B--… B--… (k times) If degB ≥ 2 then degB≥ 2 This is a corner stone in the proof of SE • Extending ambiguity to cyclic-closed strings is helpful (cf last slides) k
Proof of SE We briefly sketch the scheme for proving the claim. Starting with # G, and using the SPREAD LEMMA, the claim is reduces to: LEMMA Let Π = MN(1)…N(k), deg M=1 deg Π=l < ∞, N(i) are terminals or with pump roots then L(M) = U L(M(j)), jεJ and deg M(j)=1, J bounded. It suffices to prove it for a pair, starting with MN(1), after which M(j)N(2) are decomposed, and so on.
Proof of SE (2) For a pair MN the operation TTR is used transforming it to M*N^. Now deg M* <l and its ambiguity must be concentrated along the top trunk which it got from N. An easy direct argument shows it decomposes into a bounded union of M(j) of deg 1. As for N^ its E-depth is smaller than that of N. so for M(j)N^ we can use induction on the E-depth of the second factor or, more explicitly, continue the recursive descent on N^ until it is consumed.
Approximate G by NE G’ • Easy to achieve by duplicating symbols of the pumping classes. • Makes linguistic sense • Advantages of NE G’ using the BOT scheme view the linear G’i as finite-state transactions: powerful tool in several linguistic fields • Applications to Bio-informatic (stringology)? • Extension of NE condition to mildly context-sensitive models (LIG, TAG…)?
The Hardest Context Free Grammar The concept is due to S. Greibach. The simplest reduction is based on Shamir's homomorphism theorem([1]), mapping each b in T into a finite set φ(b) of strings over the vocabulary of the Dyck language and claiming that w is in L(G) if and only if φ(w) contains a string in the Dyck language (see the description in [1]). In fact, the categorical grammar model in the 1960 article ([2]) provides another homomorphism which makes it a hardest CFG.
However, those hardest CFG languages are inherently expansive. Indeed, an NE candidate grammar for Dyck will be negated by its BOT scheme, upon using local pump-shrinks, which for linear grammars can operate near any point of the (sufficiently long) main branch of non- terminals. We conjecture that any hardest CFG must be expansive. Note that finding a non-expansive one would entail O(n ) complexity of membership test for any context free grammar. 2
Ambiguity and Cyclic Rotation Ambiguity in natural languages can be resolved (or created) by cyclic rotation. Consider the bible verse in book of Job chapter 6 verse 14 (six Hebrew words). Translated to English: "a friend should extend mercy to the sufferer , even if he abandons God's fear." • The ambiguity here is anaphoric, does the pronoun "he" refer to the sufferer or to the friend? The poetic beautiful answer is: to both. • The rotated sentences, starting at the symbols # and $, resolve the ambiguity one way or the other. • Politically loaded example: the policeman shot the boy with the gun # $ # $
References 1. J. Autebert, J. Berstel and L. Boasson, Context-free language and pushdown automata. Chap. 3 In: handbook of formal languagesVol 1. G. Rozenberg and A. Salomaa (eds.), Springer-Verlag 1997. 2. Y. Bar-Hillel, H. Gaifman and E. Shamir, On categorical and phrase structure grammars. Bulletin research council of Israel, vol. 9f (1960), 1-16. 3. S. Greibach. The hardest context-free language. SIAM J. on computing 3 (1973), 304-310. 4. E. Shamir. Some inherently ambiguous context-free languages. Inf. and Control 18 (1971), 355-363.