330 likes | 491 Views
World class IT in a world-wide market. Learning shallow languages. Gold paradigm The case for universal grammar An interesting challenge: against UG Characteristic sample The universal distribution Shallowness Natural languages are shallow Cluster algorithms (limited treatment)
E N D
Learning shallow languages • Gold paradigm • The case for universal grammar • An interesting challenge: against UG • Characteristic sample • The universal distribution • Shallowness • Natural languages are shallow • Cluster algorithms (limited treatment) • Conclusion
The concept of shallowness in Grammar Induction Pieter Adriaans Universiteit van Amsterdam Syllogic B.V.
What I am not going to talk about • Cluster algorithms (in depth) • Characteristic context • Characteristic expression • MDL: Incremental construction: topology of compression space • Generalized counting machines • Blue Noise
ML Applications • Captains • Chat • Adaptive System Management • Robosail • Composer tool • Joint Strike Fighter
Game theory • Challenger selects language • Presents enumeration of language • Learner produces an infinite number of guesses • If there is a winning strategy for the learner then the language is learnable
The case for Universal Grammar • Children do not get negative information when they learn their first language • Natural language is at least context-free • Context-free languages cannot be learned from positive examplesErgo • Children do not really learn natural language They are born with it
Interesting challenge: against UG • Develop a learning algorithm that under acceptable bias can learn context-free grammars efficiently from positive examples • Efficiently: complexity of the learning algorithm must be polynomial in the length |G| of the grammar to be learned • Acceptable bias: must be reasonable wrt our understanding of natural language: shallow
Approach • PAC learning • Characteristic sample • Universal Distribution • Shallowness • Clustering
PAC Learning (Valiant) • Probably Approximately Correct learning • For all target concepts fF and all probability distributions P on the algorithm A outputs a concept gF such that with probability (1-), P(fg) • F = concept class = confidence parameter = error parameterfg = (f-g) (g-f)
1) Characteristic sample Let be an alphabet, the set of all strings over L(G) =S is the language generated by a grammar G CG S is a characteristic sample for G S CG
1a) Characteristic sample • Gold: infinite number of examples • Here: number of examples must polynomial in |G|, i.e. initial segment: s1,s2,s3 … sk (k |G|c) • Notion of characteristic sample is non trivial • Seems to be grammar dependent, certain constructions must appear in the sample. • Different for CFG’s and CG’s
2)The universal distribution: m • The coding theorem (Levin): -log m(x) = -log PU(x)+O(1)=K(x)+O(1) • A distribution is simple if it is dominated by a recursively enumerable distribution • Li & Vitanyi: A concept class C is learnable under m(x) iff C is also learnable under any arbitrary simple distribution P(x) provided the samples are taken according to m(x).
3a) Cheap trick • A language S=L(G) is shallow is there exists a characteristic sample CG for S such that s CG (K(s) log |G|)
3b) Cheap trick • A characteristic sample for a shallow language L(G) can be taken under m with high probability in time polynomial to |G| • If S=L(G) is shallow and there exists an algorithm A that constructs a grammar G’ from a characteristic sample CG such that L(G)=L(G’) in time polynomial in |G| then S can be learned efficiently if we sample under m
Extensions of PAC learning • PACS learning = PAC learning under simple distributions • PACSS learning(?!) = PACS learning of shallow languages
Shallowness • Seems to be an independent category • There are finite languages that are not shallow (learnable under Gold, but not shallow) type0 C Context sensitive Context-free regular Shallow
Shallowness is very restrictive < log |G| Categorial Grammar G <|G| Word Type
Cheap trick? • Shallowness seems to be pretty universal for living structures • An abundance of small building blocks of low complexity • Language: large lexicons, few simple rules • Claim: Natural languages are shallow • Programming languages not necessarily shallow • Human DNA? 3*109 nucleotides, basic building block < 32 nucleotides ?
Natural language is shallow (I) • Suppose language has CG • Grammar = lexicon • Lexicon of a native speaker = 50k words • |G|= c*50k • CG contains 50K simple sentences K(s) log |G| • I.e. |s| log |G| = log (c*50k) 16 log c • Maximum length of description of a sample sentence 16 log c
Natural language is shallow (II) • If we sample 5*106 sentences then we have a characteristic sample with confidence1- p/ep (where p= c*5*103)
Falsification of shallowness of NL • Suppose CG • Lexicon entry: <word, type> • There are few types • There is an abundance of words • In an efficient coding most of the coding length is spent on the words, not on the types • Complexity of shallow grammar number of words in the language (5*103 216) • Shallowness: we never need more than ca. 16 words to illustrate a grammatical construction
The learning algorithm • One can prove that, using clustering techniques, shallow CFG’s and CG’s can be learned efficiently from positive examples. • General idea: \/ = sentence = expression\/ = context
Clustering rigid languages Contexts Type1 Expressions Type 2 Type3 Type4
Conclusion • Shallowness is an interesting general notion • Seems to be typical for living systems and their artefacts (Language, writing, DNA). • Natural language seems to be shallow • Shallow CG’s and CFG’s can be learned efficiently sampling under m using clustering techniques • The argument for UG based on Gold’s results is unconvincing
Zipf Mandelbrot Law • Distribution of word frequencies • Zipf’s verion: 1/P • is the rank of rank of a word with probability P • Mandelbrot’s version: P = P0(+V)-1/D • D and V are independent parameters, P0 is added to create a probability distribution • When D < 1 it is a fractal dimension • D 1: dictionary contains finite number of words • D is a measure is the richness of the vocabulary
Mandelbrot’s derivation 1 • Lexicographical tree • N symbols • 1 delimiter (empty word) • Estimate probability of branch of length k:P = (1-Nr)rk for r << 1/N 0 k
Mandelbrot’s derivation 2 0 • Substitute: D= logN/log(1/r) V=1/(N-1) k=log(P/P0)/logr • We get: (P-D P0D)-1 < /V N (P-D P0D)-1 • Which yields: P = P0(+V)-1/D k 1+N+ N2+…+Nk-1 = (Nk-1)/(N-1) < < (Nk+1-1)/(N-1)
Mandelbrot’s derivation • This basically a counting argument • Only works if the lexicographical tree is regular: I.e. almost all slots are filled (or a certain fraction of the slots in a regular way) • Relation with number systems • Lexicographical tree is a Cantor space
0 0 k k Observation: Finite shallow languages coincide with regular finite lexicographical trees • s CG (K(s) log |G|) • |G| = CG surface of the tree Non shallow Shallow s s
Principle of least action • Whenever any change occurs in nature the quantity of action employed for this always the smallest possible (Maupertius 1698-1759) • Application to coding schemes: only introduce new rules if the discriminative power of the old ones is exhausted • Position system in numbers • License plates of cars, phone number • Grammars?