World class IT in a world-wide market

World class IT in a world-wide market

Learning shallow languages • Gold paradigm • The case for universal grammar • An interesting challenge: against UG • Characteristic sample • The universal distribution • Shallowness • Natural languages are shallow • Cluster algorithms (limited treatment) • Conclusion

The concept of shallowness in Grammar Induction Pieter Adriaans Universiteit van Amsterdam Syllogic B.V.

What I am not going to talk about • Cluster algorithms (in depth) • Characteristic context • Characteristic expression • MDL: Incremental construction: topology of compression space • Generalized counting machines • Blue Noise

ML Applications • Captains • Chat • Adaptive System Management • Robosail • Composer tool • Joint Strike Fighter

The Gold paradigm

Game theory • Challenger selects language • Presents enumeration of language • Learner produces an infinite number of guesses • If there is a winning strategy for the learner then the language is learnable

Learnability results (Gold)

The case for Universal Grammar • Children do not get negative information when they learn their first language • Natural language is at least context-free • Context-free languages cannot be learned from positive examplesErgo • Children do not really learn natural language They are born with it

Interesting challenge: against UG • Develop a learning algorithm that under acceptable bias can learn context-free grammars efficiently from positive examples • Efficiently: complexity of the learning algorithm must be polynomial in the length |G| of the grammar to be learned • Acceptable bias: must be reasonable wrt our understanding of natural language: shallow

Approach • PAC learning • Characteristic sample • Universal Distribution • Shallowness • Clustering

PAC Learning (Valiant) • Probably Approximately Correct learning • For all target concepts fF and all probability distributions P on  the algorithm A outputs a concept gF such that with probability (1-), P(fg)   • F = concept class = confidence parameter = error parameterfg = (f-g)  (g-f)

1) Characteristic sample Let  be an alphabet,  the set of all strings over  L(G) =S   is the language generated by a grammar G CG  S is a characteristic sample for G  S CG

1a) Characteristic sample • Gold: infinite number of examples • Here: number of examples must polynomial in |G|, i.e. initial segment: s1,s2,s3 … sk (k |G|c) • Notion of characteristic sample is non trivial • Seems to be grammar dependent, certain constructions must appear in the sample. • Different for CFG’s and CG’s

2)The universal distribution: m • The coding theorem (Levin): -log m(x) = -log PU(x)+O(1)=K(x)+O(1) • A distribution is simple if it is dominated by a recursively enumerable distribution • Li & Vitanyi: A concept class C is learnable under m(x) iff C is also learnable under any arbitrary simple distribution P(x) provided the samples are taken according to m(x).

3a) Cheap trick • A language S=L(G) is shallow is there exists a characteristic sample CG for S such that s  CG (K(s)  log |G|)

3b) Cheap trick • A characteristic sample for a shallow language L(G) can be taken under m with high probability in time polynomial to |G| • If S=L(G) is shallow and there exists an algorithm A that constructs a grammar G’ from a characteristic sample CG such that L(G)=L(G’) in time polynomial in |G| then S can be learned efficiently if we sample under m

Extensions of PAC learning • PACS learning = PAC learning under simple distributions • PACSS learning(?!) = PACS learning of shallow languages

Shallowness • Seems to be an independent category • There are finite languages that are not shallow (learnable under Gold, but not shallow) type0 C Context sensitive Context-free regular Shallow

Shallowness is very restrictive < log |G| Categorial Grammar G <|G| Word Type

Cheap trick? • Shallowness seems to be pretty universal for living structures • An abundance of small building blocks of low complexity • Language: large lexicons, few simple rules • Claim: Natural languages are shallow • Programming languages not necessarily shallow • Human DNA? 3*109 nucleotides, basic building block < 32 nucleotides ?

Natural language is shallow (I) • Suppose language has CG • Grammar = lexicon • Lexicon of a native speaker = 50k words • |G|= c*50k • CG contains 50K simple sentences K(s)  log |G| • I.e. |s|  log |G| = log (c*50k)  16 log c • Maximum length of description of a sample sentence  16 log c

Natural language is shallow (II) • If we sample 5*106 sentences then we have a characteristic sample with confidence1- p/ep (where p= c*5*103)

Falsification of shallowness of NL • Suppose CG • Lexicon entry: <word, type> • There are few types • There is an abundance of words • In an efficient coding most of the coding length is spent on the words, not on the types • Complexity of shallow grammar  number of words in the language (5*103  216) • Shallowness: we never need more than ca. 16 words to illustrate a grammatical construction

The learning algorithm • One can prove that, using clustering techniques, shallow CFG’s and CG’s can be learned efficiently from positive examples. • General idea:   \/ = sentence = expression\/ = context

Clustering rigid languages Contexts Type1 Expressions Type 2 Type3 Type4

Conclusion • Shallowness is an interesting general notion • Seems to be typical for living systems and their artefacts (Language, writing, DNA). • Natural language seems to be shallow • Shallow CG’s and CFG’s can be learned efficiently sampling under m using clustering techniques • The argument for UG based on Gold’s results is unconvincing

Zipf Mandelbrot Law • Distribution of word frequencies • Zipf’s verion:   1/P •  is the rank of rank of a word with probability P • Mandelbrot’s version: P = P0(+V)-1/D • D and V are independent parameters, P0 is added to create a probability distribution • When D < 1 it is a fractal dimension • D  1: dictionary contains finite number of words • D is a measure is the richness of the vocabulary

Mandelbrot’s derivation 1 • Lexicographical tree • N symbols • 1 delimiter (empty word) • Estimate probability of branch of length k:P = (1-Nr)rk for r << 1/N 0 k

Mandelbrot’s derivation 2 0 • Substitute: D= logN/log(1/r) V=1/(N-1) k=log(P/P0)/logr • We get: (P-D P0D)-1 < /V  N (P-D P0D)-1 • Which yields: P = P0(+V)-1/D k 1+N+ N2+…+Nk-1 = (Nk-1)/(N-1) <  < (Nk+1-1)/(N-1)

Mandelbrot’s derivation • This basically a counting argument • Only works if the lexicographical tree is regular: I.e. almost all slots are filled (or a certain fraction of the slots in a regular way) • Relation with number systems • Lexicographical tree is a Cantor space

0 0 k k Observation: Finite shallow languages coincide with regular finite lexicographical trees •  s  CG (K(s)  log |G|) • |G| = CG  surface of the tree Non shallow Shallow s s

Principle of least action • Whenever any change occurs in nature the quantity of action employed for this always the smallest possible (Maupertius 1698-1759) • Application to coding schemes: only introduce new rules if the discriminative power of the old ones is exhausted • Position system in numbers • License plates of cars, phone number • Grammars?

World class IT in a world-wide market

World class IT in a world-wide market

Presentation Transcript

World Wide Web

World Wide Web

World Wide “Reds”

World Wide Web

World wide spelling

A World-Class Education, A World-Class City

World Wide Missions

World-Wide-Web

World-Wide Web

World wide web

A World-Class Education, A World-Class City

Wide World

World Wide Web

AEG World Wide

It’s a Wide World

A World-Class Education, A World-Class City

World class IT in a world-wide market