1 / 33

World class IT in a world-wide market

World class IT in a world-wide market. Learning shallow languages. Gold paradigm The case for universal grammar An interesting challenge: against UG Characteristic sample The universal distribution Shallowness Natural languages are shallow Cluster algorithms (limited treatment)

allayna
Download Presentation

World class IT in a world-wide market

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. World class IT in a world-wide market

  2. Learning shallow languages • Gold paradigm • The case for universal grammar • An interesting challenge: against UG • Characteristic sample • The universal distribution • Shallowness • Natural languages are shallow • Cluster algorithms (limited treatment) • Conclusion

  3. The concept of shallowness in Grammar Induction Pieter Adriaans Universiteit van Amsterdam Syllogic B.V.

  4. What I am not going to talk about • Cluster algorithms (in depth) • Characteristic context • Characteristic expression • MDL: Incremental construction: topology of compression space • Generalized counting machines • Blue Noise

  5. ML Applications • Captains • Chat • Adaptive System Management • Robosail • Composer tool • Joint Strike Fighter

  6. The Gold paradigm

  7. Game theory • Challenger selects language • Presents enumeration of language • Learner produces an infinite number of guesses • If there is a winning strategy for the learner then the language is learnable

  8. Learnability results (Gold)

  9. The case for Universal Grammar • Children do not get negative information when they learn their first language • Natural language is at least context-free • Context-free languages cannot be learned from positive examplesErgo • Children do not really learn natural language They are born with it

  10. Interesting challenge: against UG • Develop a learning algorithm that under acceptable bias can learn context-free grammars efficiently from positive examples • Efficiently: complexity of the learning algorithm must be polynomial in the length |G| of the grammar to be learned • Acceptable bias: must be reasonable wrt our understanding of natural language: shallow

  11. Approach • PAC learning • Characteristic sample • Universal Distribution • Shallowness • Clustering

  12. PAC Learning (Valiant) • Probably Approximately Correct learning • For all target concepts fF and all probability distributions P on  the algorithm A outputs a concept gF such that with probability (1-), P(fg)   • F = concept class = confidence parameter = error parameterfg = (f-g)  (g-f)

  13. 1) Characteristic sample Let  be an alphabet,  the set of all strings over  L(G) =S   is the language generated by a grammar G CG  S is a characteristic sample for G  S CG

  14. 1a) Characteristic sample • Gold: infinite number of examples • Here: number of examples must polynomial in |G|, i.e. initial segment: s1,s2,s3 … sk (k |G|c) • Notion of characteristic sample is non trivial • Seems to be grammar dependent, certain constructions must appear in the sample. • Different for CFG’s and CG’s

  15. 2)The universal distribution: m • The coding theorem (Levin): -log m(x) = -log PU(x)+O(1)=K(x)+O(1) • A distribution is simple if it is dominated by a recursively enumerable distribution • Li & Vitanyi: A concept class C is learnable under m(x) iff C is also learnable under any arbitrary simple distribution P(x) provided the samples are taken according to m(x).

  16. 3a) Cheap trick • A language S=L(G) is shallow is there exists a characteristic sample CG for S such that s  CG (K(s)  log |G|)

  17. 3b) Cheap trick • A characteristic sample for a shallow language L(G) can be taken under m with high probability in time polynomial to |G| • If S=L(G) is shallow and there exists an algorithm A that constructs a grammar G’ from a characteristic sample CG such that L(G)=L(G’) in time polynomial in |G| then S can be learned efficiently if we sample under m

  18. Extensions of PAC learning • PACS learning = PAC learning under simple distributions • PACSS learning(?!) = PACS learning of shallow languages

  19. Shallowness • Seems to be an independent category • There are finite languages that are not shallow (learnable under Gold, but not shallow) type0 C Context sensitive Context-free regular Shallow

  20. Shallowness is very restrictive < log |G| Categorial Grammar G <|G| Word Type

  21. Cheap trick? • Shallowness seems to be pretty universal for living structures • An abundance of small building blocks of low complexity • Language: large lexicons, few simple rules • Claim: Natural languages are shallow • Programming languages not necessarily shallow • Human DNA? 3*109 nucleotides, basic building block < 32 nucleotides ?

  22. Natural language is shallow (I) • Suppose language has CG • Grammar = lexicon • Lexicon of a native speaker = 50k words • |G|= c*50k • CG contains 50K simple sentences K(s)  log |G| • I.e. |s|  log |G| = log (c*50k)  16 log c • Maximum length of description of a sample sentence  16 log c

  23. Natural language is shallow (II) • If we sample 5*106 sentences then we have a characteristic sample with confidence1- p/ep (where p= c*5*103)

  24. Falsification of shallowness of NL • Suppose CG • Lexicon entry: <word, type> • There are few types • There is an abundance of words • In an efficient coding most of the coding length is spent on the words, not on the types • Complexity of shallow grammar  number of words in the language (5*103  216) • Shallowness: we never need more than ca. 16 words to illustrate a grammatical construction

  25. The learning algorithm • One can prove that, using clustering techniques, shallow CFG’s and CG’s can be learned efficiently from positive examples. • General idea:   \/ = sentence = expression\/ = context

  26. Clustering rigid languages Contexts Type1 Expressions Type 2 Type3 Type4

  27. Conclusion • Shallowness is an interesting general notion • Seems to be typical for living systems and their artefacts (Language, writing, DNA). • Natural language seems to be shallow • Shallow CG’s and CFG’s can be learned efficiently sampling under m using clustering techniques • The argument for UG based on Gold’s results is unconvincing

  28. Zipf Mandelbrot Law • Distribution of word frequencies • Zipf’s verion:   1/P •  is the rank of rank of a word with probability P • Mandelbrot’s version: P = P0(+V)-1/D • D and V are independent parameters, P0 is added to create a probability distribution • When D < 1 it is a fractal dimension • D  1: dictionary contains finite number of words • D is a measure is the richness of the vocabulary

  29. Mandelbrot’s derivation 1 • Lexicographical tree • N symbols • 1 delimiter (empty word) • Estimate probability of branch of length k:P = (1-Nr)rk for r << 1/N 0 k

  30. Mandelbrot’s derivation 2 0 • Substitute: D= logN/log(1/r) V=1/(N-1) k=log(P/P0)/logr • We get: (P-D P0D)-1 < /V  N (P-D P0D)-1 • Which yields: P = P0(+V)-1/D k 1+N+ N2+…+Nk-1 = (Nk-1)/(N-1) <  < (Nk+1-1)/(N-1)

  31. Mandelbrot’s derivation • This basically a counting argument • Only works if the lexicographical tree is regular: I.e. almost all slots are filled (or a certain fraction of the slots in a regular way) • Relation with number systems • Lexicographical tree is a Cantor space

  32. 0 0 k k Observation: Finite shallow languages coincide with regular finite lexicographical trees •  s  CG (K(s)  log |G|) • |G| = CG  surface of the tree Non shallow Shallow s s

  33. Principle of least action • Whenever any change occurs in nature the quantity of action employed for this always the smallest possible (Maupertius 1698-1759) • Application to coding schemes: only introduce new rules if the discriminative power of the old ones is exhausted • Position system in numbers • License plates of cars, phone number • Grammars?

More Related