400 likes | 524 Views
Human Computer Studies. 2004: a good year for Computational Grammar Induction January 2005 Pieter Adriaans Universiteit van Amsterdam pietera@science.uva.nl http://turing.wins.uva.nl/~pietera/ALS/. GI Research Questions. Research Question: What is the complexity of human language?
E N D
Human Computer Studies 2004: a good year for Computational Grammar Induction January 2005 Pieter Adriaans Universiteit van Amsterdam pietera@science.uva.nl http://turing.wins.uva.nl/~pietera/ALS/
GI Research Questions • Research Question: What is the complexity of human language? • Research Question: Can we make a formal model of language development of young children that allows us to understand: • Why the process is efficient? • Why the process is discontinuous? • Underlying Research Question: Can we learn natural language efficiently from text? How much text is needed? How much processing is needed? • Research Question: Semantic learning: e.g. can we construct ontologies for specific domains from (scientific) text?
type0 C Context sensitive Context-free Regular Finite Human Languages Chomsky Hierarchy and the complexity of Human Language
Complexity of Natural Language: Zipf distribution Heavy Low Frequency Tail Structured High Frequency Core
Observations • Word Frequencies in human utterances dominated by powerlaws • High Frequency core • Low Frequency heavy tail • Open versus closed wordclasses (function words) • Natural Language is open. Grammar is elastic. Occurence of new words is natural phenomenon. Syntactic/semantic bootstrapping must play an important role in language learning. • Bootstrapping will be important for ontology learning as well as child language acquisition • Better understanding of NL distributions is necessary
Learn NL from text: Probabilistic versus Recursion theoretic approach • 1967: Gold. Any language more complex than super-finite sets (including regular and up the Chomsky hierarchy) can not be learned from positive data. • 1969: Horning: Probabilistic context-free grammars can be learned from positive data. Given a text T and two grammars G1 and G2 we are able to approximate max({P(G1|T), P(G2|T)}) • ICGI > 1990: empirical approach. Just build algorithms and try them. Approximate NL from below: Finite Regular Context-free Context-sensitive
Situation < 2004 • GI seems to be hard • No identification in the limit • Ill-understood Powerlaws dominate (word) frequencies in human communication • Machine learning algorithms have difficulties in these domains • PAC learning does not converge on these domains • Nowhere near learning natural languages • We were running out of ideas
Situation < 2004: Learning Regular Languages • Reasonable success in learning Regular languages of moderate complexity (Evidence Based State Merging, Blue-Fringe) • Transparant representation: Deterministic Finite Automata (DFA) • DEMO
Situation < 2004: Learning Context-free Languages • A number of approaches: Learning Probabilistic CFG, Inside-outside Algorithm, Emile, ABL. • No transparant representation: Push Down Automata (PDA) are not really helpful to model the learning process. • No adequate convergence on interesting real life corpora • Problem of sparse data sets. • Complexity issues ill-understood.
Emile: natural language allows bootstrapping • Lewis Caroll's famous poem `Jabberwocky' starts with: • 'Twas brillig, and the slithy toves • Did gyre and gimble in the wabe; • All mimsy were the borogoves • and the mome raths outgrabe.
Emile: Characteristic Expressions and Contexts • An expression of a type T is characteristic for Tif it only appears with contexts of type T • Similarly, a context of a type T is characteristic for Tif it only appears with expressions of type T. • Let G be a grammar (context-free or otherwise) of a language L. G has context separability if each type of G has a characteristic context, and expression separability if each type of G has a characteristic expression. • Natural languages seem to be context- and expression-separable. • This is nothing but stating that languages can define their own concepts internally (...is a noun, ...is a verb).
Emile: Natural languages are shallow • A class of languagesC is shallow if for each language Lit is possible to find a context- and expression-separable grammar G,and a set of sentences Sinducing characteristic contexts and expressions for all the types of G,such that the size of S and the length of the sentences of Sare logarithmic in the descriptive length of L(relative to C). • Seems to hold for natural languages Large dictionaries, low thickness
The EMILE learning algorithm • One can prove that, using clustering techniques, shallow CFG’s can be learned efficiently from positive examples drawn under m. • General idea: \/ = sentence = expression\/ = context ’
EMILE 4.1 (2000): Vervoort • Unsupervised • Two dimensional clustering: random search for maximized blocks in the matrix • Incremental: thresholds for filling degree of blocks • Simple (but sloppy) rule induction using characteristic expressions
Clustering (2-dimensional) • John makes coffee • John likes coffee • John is eating • John makes tea • John likes tea • John likes eating • \ /
Emile 4.1 Clustering Sparse Matrices of Contexts and Expressions Contexts Characteristic Expression Expressions Characteristic Context
Original grammar • S NP V_i ADV • | NP_a VP_a • | NP_a V_s that S • NP NP_a • | NP_p • VP_a V_t NP • | V_t NP P NP_p • NP_a John | Mary | the man | the child • NP_p the car | the city | the house | the shop • P with | near | in | from • V_i appears | is | seems | looks • V_s thinks | hopes | tells | says • V_t knows | likes | misses | sees • ADV large | small | ugly | beautiful
Learned Grammar after 100.000 examples • [0] [17] [6] • [0] [17] [22] [17] [6] • [0] [17] [22] [17] [22] [17] [22] [17] [6] • [6] misses [17] | likes [17] | knows [17] | sees [17] • [6] [22] [17] [6] • [6] appears [34] | looks [34] | is [34] | seems [34] • [6] [6] near [17] | [6] from [17] | [6] in [17] | [6] [6] with [17] • [17] the child | Mary | the city | the man | John | the car |the house | the shop • [22] tells that | thinks that | hopes that | says that • [22] [22] [17] [22] • [34] small | beautiful | large | ugly
Bible books • King James version • 31102 verses of 82935 lines • 4,8 Mb of English text • 001:001 In the beginning God created the heaven and the earth. • 66 Experiments with increasing sample size • Initially Book Genesis, Book Exodus, … • Full run: 40 minutes, 500 Mb on Ultra-2 Sparc
GI on the bible • [0] Thou shall not [582] • [0] Neither shalt thou [582] • [582] eat it • [582] kill . • [582] commit adultery . • [582] steal . • [582] bear false witness against thy neighbour . • [582] abhor an Edomite
Dictionary Type [76] Esau, Isaac, Abraham, Rachel, Leah, Levi, Judah, Naphtali, Asher, Benjamin, Eliphaz, Reuel, Anah, Shobal, Ezer, Dishan, Pharez, Manasseh, Gershon, Kohath, Merari, Aaron, Amram, Mushi, Shimei, Mahli,Joel, Shemaiah, Shem, Ham, Salma, Laadan, Zophah, Elpaal, Jehieli Dictionary Type [362] plague, leprosy Dictionary Type [414] Simeon, Judah, Dan, Naphtali, Gad, Asher, Issachar, Zebulun, Benjamin, Gershom Dictionary Type [812] two, three, four Dictionary Type [1056] priests, Levites, porters, singers, Nethinims Dictionary Type [978] afraid, glad, smitten, subdued Dictionary Type [2465] holy, rich, weak, prudent Dictionary Type [3086] Egypt, Moab, Dumah, Tyre, Damascus Dictionary Type [4082] heaven, Jerusalem Knowledge base in Bible
Evaluation • + Works efficiently on large corpora • + learns (partial) grammars • + unsupervised • - EMILE 4.1 needs a lot of input. • - Convergence to meaningful syntactic type rarely observed. • - Types seem to be semantic rather than syntactic. • Why? • Hypothesis: distribution in real life text is semantic, not syntactic. • But, most of all: Sparse data!!!
2004 Omphalos Competition: Starkie & van Zaanen • Unsupervised learning of context-free grammars • Deliberately constructed to be beyond current state of the art • A theoretical brute force learner that constructs all possible CFG consistent with a certain set of positive examples O. • Complexity measure for CFG’s. • There are only:2i(2|Oj| -2) + 1) + (i(2|Oj| -2) + 1) ) * T(O)of these grammars, where T(O) is the number of terminals!!
2004 Omphalos Competition: Starkie & van Zaanen Let be an alphabet, the set of all strings over L(G) =S is the language generated by a grammar G CG S is a characteristic sample for G (infinite) S (infinite) O (finite Omphalos sample) CG |O| < 20 |CG|
Bad news for distributional analysis (Emile, ABL, Inside out) w {an bn} [a,S] push X [a,X] push X aeb aaebb aaaebbb … [b,X] pop X 0 1 1 [e,X] no-op [,X] pop w {{a,c}n {b,d}n} [a,S] push X [a,X] push X aeb ceb aed ced aaebb aaebd caebb caebd acebb … aaaebbb … [b,X] pop X 1 0 1 [e,X] no-op [,X] pop [c,S] push X [c,X] push X [d,X] pop X
Bad news for distributional analysis (Emile, ABL, Inside out) a aeb b a aeb d c aeb b c aeb d a ceb b a ceb d c ceb b c ceb d … a aeb b a aaebb b We need large corpora to make distributional analysis working. Omphalos samples are way to small!!
Omphalos won by Alexander Clark: some good ideas! Approach: • Exploit useful properties that randomly generatedgrammars are likely to have • Identifying constituents: Measure local mutual information between symbol beforeand symbol after.[ Clark, 2001]. More reliable than other information theoretic constituentboundary tests. [Lamb, 1961] [Brill et al., 1990] • Under benign distributions non-constituents will have zero mutual information crossing constituent boundaries. Structures that do not cross constituent boundaries will have non-zero mutual information. • Analysis of cliques of strings that might be constituents (Much like clusters in EMILE). • Most hard problems in Omphalos still open!!
But, is Omphalos the right challenge? What about NL? Natural Languages Need for larger samples Shallow languages Harder to learn Log # of terminals || Omphalos Complexity of the grammar |P|/|N|
ADIOS (Automatic DIstillation Of Structure) Solan et al. 2004 • Representation of a corpus (of sentences) as paths over a graph whose vertices are lexical elements (words) • Motif Extraction (MEX) procedure for establishing new vertices thus progressively redefining the graph in an unsupervised fashion • Recursive Generalization • Zach Solan, David Horn, Eytan Ruppin (Tel Aviv University) & Shimon Edelman (Cornell) • http://www.tau.ac.il/~zsolan
cat ? node edge where (1) 101 (2) (5) 104 (6) (1) 101 (2) BEGIN is (1) (2) 102 END (6) (5) 104 103 (2) (7) 103 (3) and (1) (6) 104 (4) (3) 102 (4) the (5) 102 101 (3) that a (3) (4) (6) horse (5) (4) dog The Model (Solan et al. 2004) • Graph representation with words as vertices and sentences as paths. And is that a horse? Is that a dog? Where is the dog? Is that a cat?
From MEX to ADIOS (Solan et al. 2004) Apply MEX to search-path consisting of a given data-path. On same search-path, within a given window size, allow for the occurrence of an equivalence class, i.e. define a generalized search-path of the type e1-> e2->…-> {E} ->…->ek. Apply MEX to this window. Choose patterns P, including equivalence classes E according to MEX ranking. Add nodes. Repeat the above for all search-paths. Repeat the procedure to obtain higher level generalizations. Express structures in syntactic trees.
First pattern formation Higher hierarchies: patterns (P) constructed of other Ps, equivalence classes (E) and terminals (T) Trees to be read from top to bottom and from left to right Final stage: root pattern CFG: context free grammar
Solan et al. 2004 • The ADIOS algorithm has been evaluated using artificial grammars containing thousands of rules, natural languages as diverse as English and Chinese, regulatory and coding regions in DNA sequences and functionally relevant structures in protein data. • Complexity of ADIOS on large NL corpora seems to be linear in the size of the corpus. • Allows mild context sensitive learning • This is the first time an unsupervised algorithm is shown capable of learning complex syntax, and score well in standard language proficiency tests!! (Trainingset 300.000 sentences from CHILDES, ADIOS scoring intermediate level (58%) in Göteborg/ESL test).
ADIOS learning from ATIS-CFG (4592 rules)using different numbers of learners, and different window length L
Where does ADIOS fit in? Natural Languages ADIOS Need for larger samples Shallow languages Harder to learn # of terminals || Omphalos Complexity of the grammar |P|/|N|
GI Research Questions • Research Question: What is the complexity of human language? • Research Question: Can we make a formal model of language development of young children that allows us to understand: • Why the process is efficient? • Why the process is discontinuous? • Underlying Research Question: Can we learn natural language efficiently from text? How much text is needed? How much processing is needed? • Research Question: Semantic learning: e.g. can we construct ontologies for specific domains from (scientific) text?
Conclusions & Further work • We start to crack the code of unsupervised learning of human languages • ADIOS is the first algorithm capable of learning complex syntax, and scoring well in standard language proficiency tests • We have better statistical techniques to separate constituents form non-constituents. • Good ideas: pseudo graph representation, MEX, sliding windows. To be done: • Can MEX help us in DFA induction? • Better understanding of the complexity issues. When does MEX collapse? • Better understanding of Semantic Learning • Incremental Learning with background knowledge • Use GI to learn ontologies