270 likes | 600 Views
Language Learning Week 12 Pieter Adriaans: pietera@science.uva.nl Sophia Katrenko: katrenko@science.uva.nl Contents Week 12 Semantic Learning The Omphalos competition Adios Problems Results at first sight disappointing. Conversion to meaningful syntactic type rarely observed.
E N D
Language Learning Week 12 Pieter Adriaans: pietera@science.uva.nl Sophia Katrenko: katrenko@science.uva.nl
Contents Week 12 • Semantic Learning • The Omphalos competition • Adios
Problems • Results at first sight disappointing. • Conversion to meaningful syntactic type rarely observed. • Types seem to be semantic rather than syntactic. • Why? • Hypothesis: distribution in real life text is semantic, not syntactic. • Semantic grammar is intermediate compression level between term algebra syntactic algebra.
Characteristic sample: Semantic Learning Let be an alphabet, the set of all strings over L(G) =S is the language generated by a grammar G CG S is a characteristic sample for G S True CG
Syntactic Learning: Substitution salva beneformatione Tweety is_a bird Tweety is_a dog Tweety is_a horse Tweety is_a mammal Fido is_a bird Fido is_a dog Fido is_a horse Fido is_a mammal Ed is_a bird Ed is_a dog Ed is_a horse Ed is a mammal bird dog horse mammal Ed Fido Tweety Sentence Noun Name
Semantic Learning: Substitution salva veritate Tweety is_a bird Tweety is_a dog Tweety is_a horse Tweety is_a mammal Fido is_a bird Fido is_a dog Fido is_a horse Fido is_a mammal Ed is_a bird Ed is_a dog Ed is_a horse Ed is a mammal bird dog horse mammal Ed Fido Tweety True Sentence Noun Name
Semantic Learning: Substitution salva veritate Tweety is_a bird Tweety is_a dog Tweety is_a horse Tweety is_a mammal Fido is_a bird Fido is_a dog Fido is_a horse Fido is_a mammal Ed is_a bird Ed is_a dog Ed is_a horse Ed is a mammal bird dog horse mammal Ed Fido Tweety Compositionality: Semantics = Intermediate Compression level True False Ed Fido Tweety Mammal Horse Dog Bird Sentence Noun Name
Not a bug, but a feature: semantic learning Dictionary Type [362] plague, leprosy Dictionary Type [1056] priests, Levites, porters, singers, Nethinims Dictionary Type [978] afraid, glad, smitten, subdued Dictionary Type [2465] holy, rich, weak, prudent
2004 Omphalos Competition: Starkie & van Zaanen • Unsupervised learning of context-free grammars • Deliberately constructed to be beyond current state of the art • A theoretical brute force learner that constructs all possible CFG consistent with a certain set of positive examples O. • Complexity measure for CFG’s. • There are only:2i(2|Oj| -2) + 1) + (i(2|Oj| -2) + 1) ) * T(O)of these grammars, where T(O) is the number of terminals!!
2004 Omphalos Competition: Starkie & van Zaanen Let be an alphabet, the set of all strings over L(G) =S is the language generated by a grammar G CG S is a characteristic sample for G (infinite) S (infinite) O (finite Omphalos sample) CG |O| < 20 |CG|
Bad news for distributional analysis (Emile, ABL, Inside out) w {an bn} [a,S] push X [a,X] push X aeb aaebb aaaebbb … [b,X] pop X 0 1 1 [e,X] no-op [,X] pop w {{a,c}n {b,d}n} [a,S] push X [a,X] push X aeb ceb aed ced aaebb aaebd caebb caebd acebb … aaaebbb … [b,X] pop X 1 0 1 [e,X] no-op [,X] pop [c,S] push X [c,X] push X [d,X] pop X
Bad news for distributional analysis (Emile, ABL, Inside out) a aeb b a aeb d c aeb b c aeb d a ceb b a ceb d c ceb b c ceb d … a aeb b a aaebb b We need large corpora to make distributional analysis working. Omphalos samples are way to small!!
Omphalos won by Alexander Clark: some good ideas! Approach: • Exploit useful properties that randomly generatedgrammars are likely to have • Identifying constituents: Measure local mutual information between symbol beforeand symbol after.[ Clark, 2001]. More reliable than other information theoretic constituentboundary tests. [Lamb, 1961] [Brill et al., 1990] • Under benign distributions non-constituents will have zero mutual information crossing constituent boundaries. Structures that do not cross constituent boundaries will have non-zero mutual information. • Analysis of cliques of strings that might be constituents (Much like clusters in EMILE). • Most hard problems in Omphalos still open!!
But, is Omphalos the right challenge? What about NL? Natural Languages Need for larger samples Shallow languages Harder to learn Log # of terminals || Omphalos Complexity of the grammar |P|/|N|
ADIOS (Automatic DIstillation Of Structure) Solan et al. 2004 • Representation of a corpus (of sentences) as paths over a graph whose vertices are lexical elements (words) • Motif Extraction (MEX) procedure for establishing new vertices thus progressively redefining the graph in an unsupervised fashion • Recursive Generalization • Zach Solan, David Horn, Eytan Ruppin (Tel Aviv University) & Shimon Edelman (Cornell) • http://www.tau.ac.il/~zsolan
cat ? node edge where (1) 101 (2) (5) 104 (6) (1) 101 (2) BEGIN is (1) (2) 102 END (6) (5) 104 103 (2) (7) 103 (3) and (1) (6) 104 (4) (3) 102 (4) the (5) 102 101 (3) that a (3) (4) (6) horse (5) (4) dog The Model (Solan et al. 2004) • Graph representation with words as vertices and sentences as paths. And is that a horse? Is that a dog? Where is the dog? Is that a cat?
From MEX to ADIOS (Solan et al. 2004) Apply MEX to search-path consisting of a given data-path. On same search-path, within a given window size, allow for the occurrence of an equivalence class, i.e. define a generalized search-path of the type e1-> e2->…-> {E} ->…->ek. Apply MEX to this window. Choose patterns P, including equivalence classes E according to MEX ranking. Add nodes. Repeat the above for all search-paths. Repeat the procedure to obtain higher level generalizations. Express structures in syntactic trees.
First pattern formation Higher hierarchies: patterns (P) constructed of other Ps, equivalence classes (E) and terminals (T) Trees to be read from top to bottom and from left to right Final stage: root pattern CFG: context free grammar
Solan et al. 2004 • The ADIOS algorithm has been evaluated using artificial grammars containing thousands of rules, natural languages as diverse as English and Chinese, regulatory and coding regions in DNA sequences and functionally relevant structures in protein data. • Complexity of ADIOS on large NL corpora seems to be linear in the size of the corpus. • Allows mild context sensitive learning • This is the first time an unsupervised algorithm is shown capable of learning complex syntax, and score well in standard language proficiency tests!! (Trainingset 300.000 sentences from CHILDES, ADIOS scoring intermediate level (58%) in Göteborg/ESL test).
ADIOS learning from ATIS-CFG (4592 rules)using different numbers of learners, and different window length L
Where does ADIOS fit in? Natural Languages ADIOS Need for larger samples Shallow languages Harder to learn # of terminals || Omphalos Complexity of the grammar |P|/|N|
GI Research Questions • Research Question: What is the complexity of human language? • Research Question: Can we make a formal model of language development of young children that allows us to understand: • Why the process is efficient? • Why the process is discontinuous? • Underlying Research Question: Can we learn natural language efficiently from text? How much text is needed? How much processing is needed? • Research Question: Semantic learning: e.g. can we construct ontologies for specific domains from (scientific) text?
Conclusions & Further work • We start to crack the code of unsupervised learning of human languages • ADIOS is the first algorithm capable of learning complex syntax, and scoring well in standard language proficiency tests • We have better statistical techniques to separate constituents form non-constituents. • Good ideas: pseudo graph representation, MEX, sliding windows. To be done: • Can MEX help us in DFA induction? • Better understanding of the complexity issues. When does MEX collapse? • Better understanding of Semantic Learning • Incremental Learning with background knowledge • Use GI to learn ontologies
Contents Week 12 • Semantic Learning • The Omphalos competition • Adios