1.84k likes | 1.97k Views
Grammatical inference: techniques and algorithms. Colin de la Higuera. Acknowledgements.
E N D
Grammatical inference: techniques and algorithms Colin de la Higuera
Acknowledgements • Laurent Miclet, Tim Oates, Jose Oncina, Rafael Carrasco, Paco Casacuberta, Pedro Cruz, Rémi Eyraud, Philippe Ezequel, Henning Fernau, Jean-Christophe Janodet, Thierry Murgue, Frédéric Tantini, Franck Thollard, Enrique Vidal,... • … and a lot of other people to whom I am grateful
Outline 1 An introductory example 2 About grammatical inference 3 Some specificities of the task 4 Some techniques and algorithms 5 Open issues and questions
1 How do we learn languages? A very simple example
The problem: • You are in an unknown city and have to eat. • You therefore go to some selected restaurants. • Your goal is therefore to build a model of the city (a map).
The data • Up Down Right Left Left Restaurant • Down Down Right Not a restaurant • Left Down Restaurant
Hopefully something like this: u,r N u d R d,l N u R d r
N d u N u d R R d N u u u R N d d R R r d u N
Further arguments (1) • How did we get hold of the data? • Random walks • Following someone • someone knowledgeable • Someone trying to lose us • Someone on a diet • Exploring
Further arguments (2) • Can we not have better information (for example the names of the restaurants)? • But then we may only have the information about the routes to restaurants (not to the “non restaurants”)…
Further arguments (3) What if instead of getting the information “Elimo” or “restaurant”, I get the information “good meal” or “7/10”? Reinforcement learning: POMDP
Further arguments (4) • Where is my algorithm to learn these things? • Perhaps should I consider several algorithms for the different types of data?
Further arguments (5) • What can I say about the result? • What can I say about the algorithm?
Further arguments (6) • What if I want something richer than an automaton? • A context-free grammar • A transducer • A tree automaton…
Further arguments (7) • Why do I want something as rich as an automaton? • What about • A simple pattern? • Some SVM obtained from features over the strings? • A neural network that would allow me to know if some path will bring me or not to a restaurant, with high probability?
Our goal/idea • Old Greeks: A whole is more than the sum of all parts • Gestalt theory A whole is different than the sum of all parts
Better said • There are cases where the data cannot be analyzed by considering it in bits • There are cases where intelligibility of the pattern is important
What do people know about formal language theory? Nothing Lots
A small reminder on formal language theory • Chomsky hierarchy • + and – of grammars
A crash course in Formal language theory • Symbols • Strings • Languages • Chomsky hierarchy • Stochastic languages
Symbols are taken from some alphabet Strings are sequences of symbols from
Languages are sets of strings over Languages are subsets of *
Special languages • Are recognised by finite state automata • Are generated by grammars
b a a a b b DFA: Deterministic Finite State Automaton
b a a a b b ababL
What is a context free grammar? A 4-tuple (Σ, S, V, P) such that: • Σ is the alphabet; • V is a finite set of non terminals; • S is the start symbol; • P V (VΣ)*is a finite set of rules.
Example of a grammar The Dyck1 grammar • (Σ, S, V, P) • Σ = {a, b} • V = {S} • P = {S aSbS, S }
Derivations and derivation trees S S aSbS aaSbSbS aabSbS aabbS aabb a S b S a S b S
Chomsky Hierarchy • Level 0: no restriction • Level 1: context-sensitive • Level 2: context-free • Level 3: regular
Chomsky Hierarchy • Level 0: Whatever Turing machines can do • Level 1: • {anbncn: n} • {anbmcndm: n,m} • {uu: u*} • Level 2: context-free • {anbn: n} • brackets • Level 3: regular • Regular expressions (GREP)
The membership problem • Level 0: undecidable • Level 1: decidable • Level 2: polynomial • Level 3: linear
The equivalence problem • Level 0: undecidable • Level 1: undecidable • Level 2: undecidable • Level 3: Polynomial only when the representation is DFA.
b a b a a b PFA: Probabilistic Finite (state) Automaton
0.1 b 0.9 a a 0.35 0.7 a 0.7 b 0.65 b 0.3 0.3 DPFA: Deterministic Probabilistic Finite (state) Automaton
What is nice with grammars? • Compact representation • Recursivity • Says how a string belongs, not just if it belongs • Graphical representations (automata, parse trees)
What is not so nice with grammars? • Even the easiest class (level 3) contains SAT, Boolean functions, parity functions… • Noise is very harmful: • Think about putting edit noise to language {w: |w|a=0[2]|w|b=0[2]}
2 Specificities of grammatical inference Grammatical inference consists (roughly) in finding the (a) grammar or automaton that has produced a given set of strings (sequences, trees, terms, graphs).
The field Inductive Inference Pattern Recognition Machine Learning Grammatical Inference Computational linguistics Computational biology Web technologies
The data • Strings, trees, terms, graphs • Structural objects • Basically the same gap of information as in programming between tables/arrays and data structures
Alternatives to grammatical inference • 2 steps: • Extract features from the strings • Use a very good method over n.
Examples of strings A string in Gaelic and its translation to English: • Tha thu cho duaichnidh ri èarr àirde de a’ coisich deas damh • You are as ugly as the north end of a southward traveling ox
>A BAC=41M14 LIBRARY=CITB_978_SKB AAGCTTATTCAATAGTTTATTAAACAGCTTCTTAAATAGGATATAAGGCAGTGCCATGTA GTGGATAAAAGTAATAATCATTATAATATTAAGAACTAATACATACTGAACACTTTCAAT GGCACTTTACATGCACGGTCCCTTTAATCCTGAAAAAATGCTATTGCCATCTTTATTTCA GAGACCAGGGTGCTAAGGCTTGAGAGTGAAGCCACTTTCCCCAAGCTCACACAGCAAAGA CACGGGGACACCAGGACTCCATCTACTGCAGGTTGTCTGACTGGGAACCCCCATGCACCT GGCAGGTGACAGAAATAGGAGGCATGTGCTGGGTTTGGAAGAGACACCTGGTGGGAGAGG GCCCTGTGGAGCCAGATGGGGCTGAAAACAAATGTTGAATGCAAGAAAAGTCGAGTTCCA GGGGCATTACATGCAGCAGGATATGCTTTTTAGAAAAAGTCCAAAAACACTAAACTTCAA CAATATGTTCTTTTGGCTTGCATTTGTGTATAACCGTAATTAAAAAGCAAGGGGACAACA CACAGTAGATTCAGGATAGGGGTCCCCTCTAGAAAGAAGGAGAAGGGGCAGGAGACAGGA TGGGGAGGAGCACATAAGTAGATGTAAATTGCTGCTAATTTTTCTAGTCCTTGGTTTGAA TGATAGGTTCATCAAGGGTCCATTACAAAAACATGTGTTAAGTTTTTTAAAAATATAATA AAGGAGCCAGGTGTAGTTTGTCTTGAACCACAGTTATGAAAAAAATTCCAACTTTGTGCA TCCAAGGACCAGATTTTTTTTAAAATAAAGGATAAAAGGAATAAGAAATGAACAGCCAAG TATTCACTATCAAATTTGAGGAATAATAGCCTGGCCAACATGGTGAAACTCCATCTCTAC TAAAAATACAAAAATTAGCCAGGTGTGGTGGCTCATGCCTGTAGTCCCAGCTACTTGCGA GGCTGAGGCAGGCTGAGAATCTCTTGAACCCAGGAAGTAGAGGTTGCAGTAGGCCAAGAT GGCGCCACTGCACTCCAGCCTGGGTGACAGAGCAAGACCCTATGTCCAAAAAAAAAAAAA AAAAAAAGGAAAAGAAAAAGAAAGAAAACAGTGTATATATAGTATATAGCTGAAGCTCCC TGTGTACCCATCCCCAATTCCATTTCCCTTTTTTGTCCCAGAGAACACCCCATTCCTGAC TAGTGTTTTATGTTCCTTTGCTTCTCTTTTTAAAAACTTCAATGCACACATATGCATCCA TGAACAACAGATAGTGGTTTTTGCATGACCTGAAACATTAATGAAATTGTATGATTCTAT
<book> <part> <chapter> <sect1/> <sect1> <orderedlist numeration="arabic"> <listitem/> <f:fragbody/> </orderedlist> </sect1> </chapter> </part> </book>
<?xml version="1.0"?><?xml-stylesheet href="carmen.xsl" type="text/xsl"?><?cocoon-process type="xslt"?> <!DOCTYPE pagina [<!ELEMENT pagina (titulus?, poema)><!ELEMENT titulus (#PCDATA)><!ELEMENT auctor (praenomen, cognomen, nomen)><!ELEMENT praenomen (#PCDATA)><!ELEMENT nomen (#PCDATA)><!ELEMENT cognomen (#PCDATA)><!ELEMENT poema (versus+)><!ELEMENT versus (#PCDATA)>]> <pagina><titulus>Catullus II</titulus><auctor><praenomen>Gaius</praenomen><nomen>Valerius</nomen><cognomen>Catullus</cognomen></auctor>