280 likes | 403 Views
Tutorial - I. 2 nd September 2005. Problem 1: N-grams. Let C be a natural language corpus consisting of N tokens and V types w 1 , w 2 , ..., w V . Let p i be the unigram probability of w i estimated from C . Also, given that ij, i < j p i p j
E N D
Tutorial - I 2nd September 2005
Problem 1: N-grams • Let C be a natural language corpus consisting of N tokens and V types w1, w2, ..., wV. Let pi be the unigram probability of wi estimated from C. Also, given that ij, i < j pi pj • Give an estimate for pi in terms of N, V, and i. • An artificial corpus C1 was generated stochastically on the basis of the unigram probabilities pi. Estimate the bigram probabilities pij = P(wi wj) for C1 in terms of N, V, i & j. [Hint: Use the expression for pi derived above] Soln. Soln.
Problem 1: N-grams (contd.) • Show that the bigram distribution of C1 does not follow Zipf’s law perfectly. For this, use the estimated expression for pij derived in (b). • It is known that natural languages exhibit Zipfian distribution over n-grams for all n. Can you use this fact to show that the bigram characteristics of C1 is different from C. • Prove the generalization of (d), i.e. “for any finite n, a stochastically generated corpus Cnbased on the n-gram estimates of C has different (n+1)-gram characteristics from C”. What can you infer from this about n-gram models for natural languages? Soln. Soln. Soln.
Problem 2: Problematic AND! • Given below is a toy grammar G for English.
Problem 2: Problematic AND! (contd.) • Show that the sentence “John liked Mary and Mary liked John” is ambiguous for G. Point out the parse(s) that you think is/are semantically correct. • The sentence “John said John and Mary liked John”? has the same structure as that of (a). Is the semantically valid parse for (a) also meaning-ful for (b)? Why or why not? Soln. Soln.
Problem 2: Problematic AND! (contd.) • The ambiguity arises because and can connect noun and verb phrases as well as clauses. Can you suggest a method to resolve this (at least partially) by • Verb sub-categorization • By introducing new POS categories (not for verbs) and augmenting G accordingly. [Assume that POS tagging is a step before parsing and the process is perfect] Soln.
Problem 3: Geo-Morph • Consider the following pairs of the name of the Geographical location and the corresponding terms for their dwellers. Let us call this system of morphology Geo-Morph.
Problem 3: Geo-Morph (contd.) • Classify Geo-Morph as derivational/inflectional and linear/non-linear system of morphology. • Identify the set of affixes. Classify the examples as regular and irregular cases. Classify the regular cases further by the affixes. • Identify the different morphological paradigms. Can you classify the Geo-roots based on their graphemic/phonemic structure into these paradigms? • Design rewrite rules to capture orthographic changes for these paradigms. Soln. Soln.
Problem 3: Geo-Morph (contd.) • Predict the dweller terms for the following Geo-roots based on the morphological system developed with the help of the paradigms and the rewrite rules (c-d). Which of them do you think are used in standard English? • Sweden • Oman • Libya • Vienna • Europe Soln.
Solution 1(a): N-grams a) ij, i < j pi pj implies that wi s are sorted in descending order of unigram probability, i.e. frequencies. In other words, the rank (according to frequency) of wi is i. According to Zipf’s law, frequency rank = constant
Solution 1(b): N-grams b) Since C1 was generated stochastically based on the unigram probabilities only, the two tokens ts and ts+1 in C1 were generated independent of each other. In other words, the events ts = wi and ts+1 = wj are independent. Therefore, pij = P(ts = wi ts+1 = wj) = P(ts = wi) P(ts+1 = wj) = pi pj 1/(ijln2V)
Solution 1(c): N-grams c) If the bigram distribution of C1 has to follow Zipf’s law, then bigram-probability bigram-rank = constant (say k’), We know that pij 1/(ijln2V) Therefore, first few bigram probabilities in order of rank are p1,1, p1,2, p2,1, p3,1, p1,3, p4,1, ... k’ = p1,1 1 = 1/ ln2V But, then p2,1 = 1/2ln2V 1/3ln2V p3,1 = 1/3ln2V 1/4ln2V p1,3 = 1/3ln2V 1/5ln2V Thus, it does not follow Zipf’s law (and even Mandelbrot’s law)
Solution 1(d): N-grams d) It follows from (c) that the bigram distribution of C1 does not follow Zipf’s law, whereas that of C does. Therefore, the bigram characteristics of the two distribution must be different. We know that for C1, pij 1/(ijln2V). However, just as in (a) we can estimate the bigram distribution of Cfrom the Zipfian assumption. There are V2 probabilities. Therefore, we can assume that [br is the probability of the rth bigram. br = 1/(2rlnV) But, this estimate may be quite erroneous. Why?
Solution 1(e): N-grams e)Hint: Assume Zipf’s law for n-grams. Estimate n+1-gram probabilities from n-grams (product of two n-gram probabilities). Now show that n+1-grams does not follow Zipf's law • Try to prove the following (more general) results: • Mandelbrot’s law, a generalization of Zipf’s law says (frequency + ρ) rankα= constant. Prove (c), (d) and (e) when the distribution follows Mandelbrot’s law rather than Zipf’s law. • For any finite length corpus (i.e. when N is finite), we cannot have n-gram distributions that follow Mandelbrot’s law perfectly.
Solution 2(a): Problematic AND! PARSE 1
Solution 2(a): Problematic AND! PARSE 2
Solution 2(b): Problematic AND! PARSE 1
Solution 2(b): Problematic AND! PARSE 2
Solution 2(c): problematic AND! • Verb Sub-categorization: Verbs liked and said belong to subcategories 1 and 2 respectively, where • VP V NP [For V in 1] • VP V S [For V in 2] • POS category Augmentation: Break CNJ into two categories CNJP and CNJC for phrasal and clausal conjunctions respectively. The grammar G is augmented as:
Solution 2(c): problematic AND! • The new G for English.
Solution 2(c): Problematic AND! Parsing using the new grammar
Solution 2(c): Problematic AND! Parsing using the new grammar
Solution 2(b): Problematic AND! Cannot parse otherwise
Solution (3ab): Geo-Morph • Derivational and Linear • Irregulars are shown in red, affixes: n, ese
Solution (3cd): Geo-Morph • Based on endings of the roots we might try to classify them into 4 paradigms [C:consonant-y, V:Vowel+y]: • CVa, [V/a]CC* takes n, • Ca, aCtakes ese • The Rewrite rules: • n ian / C^_$ (Egypt^n Egyptian) • a Φ/C_^ese (China^ese Chinese etc.)
A Problem to Ponder • Try to design a complete set of morphological rules for English Geo-Morph • How many affixes, paradigms and exceptions do you expect? • Is it possible to classify the Geo-roots based solely on the graphemic/phonemic forms?