850 likes | 1k Views
Grammatical inference Vs Grammar induction. London 21-22 June 2007. Colin de la Higuera. Summary. Why study the algorithms and not the grammars Learning in the exact setting Learning in a probabilistic setting. 1 Why study the process and not the result?.
E N D
Grammatical inference Vs Grammar induction London 21-22 June 2007 Colin de la Higuera
Summary • Why study the algorithms and not the grammars • Learning in the exact setting • Learning in a probabilistic setting
1 Why study the process and not the result? • Usual approach in grammatical inference is to build a grammar (automaton), small and adapted in some way to the data from which we are supposed to learn from.
Grammatical inference • Is about learning a grammar given information about a language.
Grammar induction • Is about learning a grammar given information about a language.
Grammatical inference Grammar induction Difference? Data G
Motivating* example #1 • Is 17 a random number? • Is 17 more random than 25? • Suppose I had a random number generator, would I convince you by showing how well it does on an example? On various examples ? *(and only slightly provocative)
Motivating example #2 • Is 01101101101101010110001111 a random sequence? • What about aaabaaabababaabbba?
Motivating example #3 • Let X be a sample of strings. Is grammar G the correct grammar for sample X? • Or is it G’ ? • Correct meaning something like “the one we should learn”
Back to the definition • Grammar induction and grammatical inference are about finding a/the grammar from some information about the language. • But once we have done that, what can we say?
What would we like to say? • That the grammar is the smallest, best (re a score). Combinatorial characterisation • What we really want to say is that having solved some complex combinatorial question we have an Occam, Compression-MDL-Kolmogorov like argument proving that what we have found is of interest.
What else might we like to say? • That in the near future, given some string, we can predict if this string belongs to the language or not. • It would be nice to be able to bet £100 on this.
What else would we like to say? • That if the solution we have returned is not good, then that is because the initial data was bad (insufficient, biased). • Idea: blame the data, not the algorithm.
Suppose we cannot say anything of the sort? • Then that means that we may be terribly wrong even in a favourable setting.
Motivating example #4 • Suppose we have an algorithm that ‘learns’ a grammar by applying iteratively the following two operations: • Merge two non-terminals whenever some nice MDL-like rule holds • Add a new non-terminal and rule corresponding to a substring when needed
Two learning operators Creation of non terminals and rules NPART ADJ NOUN NPART ADJ ADJ NOUN NPART AP1 NPART ADJ AP1 AP1 ADJ NOUN
Merging two non terminals NPART AP1 NPART AP2 AP1 ADJ NOUN AP2 ADJ AP1 NPART AP1 AP1 ADJ NOUN AP1 ADJ AP1
What is bound to happen? • We will learn a context-free grammar that can only generate a regular language. • Brackets are not found. • This is a hidden bias.
But how do we say that a learning algorithm is good? • By accepting the existence of a target. • The question is that of studying the process of finding this target (or something close to this target). This is an inference process.
If you don’t believe there is a target? • Or that the target belongs to another class • You will have to come up with another bias. For example, believing that simplicity (eg MDL) is the correct way to handle the question.
If you are prepared to accept there is a target but.. • Either the target is known and what is the point or learning? • Or we don’t know it in the practical case (with this data set) and it is of no use…
Careful • Some statements that are dangerous • Algorithm A can learn {anbncn: nN} • Algorithm B can learn this rule with just 2 examples • Looks to me close to wanting free lunch
A compromise • You only need to believe there is a target while evaluating the algorithm. • Then, in practice, there may not be one!
End of provocative example • If I run my random number generator and get 999999, I can only keep this number if I believe in the generator itself.
Credo (1) • Grammatical inference is about measuring the convergence of a grammar learning algorithm in a typical situation.
Credo(2) • Typical can be: • In the limit: learning is always achieved, one day • Probabilistic • There is a distribution to be used (Errors are measurably small) • There is a distribution to be found
Credo(3) • Complexity theory should be used: the total or update runtime, the size of the data needed, the number of mind changes, the number and weight of errors… • …should be measured and limited.
2 Non probabilistic setting • Identification in the limit • Resource bounded identification in the limit • Active learning (query learning)
Identification in the limit • The definitions, presentations • The alternatives • Order free or not • Randomised algorithm
A presentation is a function f: NX where X is any set, • yields: Presentations Languages • If f(N)=g(N)thenyields(f)= yields(g)
Learning function • Given a presentation f, fn is the set of the first n elements in f. • A learning algorithmais a function that takes as input a set fn ={f(0),…,f (n-1)} and returns a grammar. • Given a grammar G, L(G) is the language generated/recognised/ represented by G.
Identification in the limit f(N)=g(N) yields(f)=yields(g) yields A class of languages L Pres NX a L A learner The naming function G A class of grammars n N:k>n L(a(fk))=yields(f)
What about efficiency? • We can try to bound • global time • update time • errors before converging • mind changes • queries • good examples needed
What should we try to measure? • The size of G ? • The size of L ? • The size of f ? • The size of fn ?
Some candidates for polynomial learning • Total runtime polynomial in ║L║ • Update runtime polynomial in ║L║ • # mind changes polynomial in ║L║ • # implicit prediction errors polynomial in ║L║ • Size of characteristic sample polynomial in ║L║
f(0) f(1) f(n-1) f(k) f1 f2 fn fk a a a a G1 G2 Gn Gn
3 Probabilistic setting • Using the distribution to measure error • Identifying the distribution • Approximating the distribution
Probabilistic settings • PAC learning • Identification with probability 1 • PAC learning distributions
Learning a language from sampling • We have a distribution over * • We sample twice: • Once to learn • Once to see how well we have learned • The PAC setting Probably approximately correct
PAC learning(Valiant 84, Pitt 89) • L a set of languages • G a set of grammars • >0 and >0 • m a maximal length over the strings • n a maximal size of grammars
Polynomially PAC learnable • There is an algorithm that samples reasonably and returns with probability at least 1- a grammar that will make at most errors.
Results • Using cryptographic assumptions, we cannot PAC learn DFA. • Cannot PAC learn NFA, CFGs with membership queries either.
Learning distributions • No error • Small error
No error • This calls for identification in the limit with probability 1. • Means that the probability of not converging is 0.
Results • If probabilities are computable, we can learn with probability 1 finite state automata. • But not with bounded (polynomial) resources.
With error • PAC definition • But error should be measured by a distance between the target distribution and the hypothesis • L1,L2,L ?