Human Computer Studies

Human Computer Studies 2004: a good year for Computational Grammar Induction January 2005 Pieter Adriaans Universiteit van Amsterdam pietera@science.uva.nl http://turing.wins.uva.nl/~pietera/ALS/

GI Research Questions • Research Question: What is the complexity of human language? • Research Question: Can we make a formal model of language development of young children that allows us to understand: • Why the process is efficient? • Why the process is discontinuous? • Underlying Research Question: Can we learn natural language efficiently from text? How much text is needed? How much processing is needed? • Research Question: Semantic learning: e.g. can we construct ontologies for specific domains from (scientific) text?

type0 C Context sensitive Context-free Regular Finite Human Languages Chomsky Hierarchy and the complexity of Human Language

Complexity of Natural Language: Zipf distribution Heavy Low Frequency Tail Structured High Frequency Core

Observations • Word Frequencies in human utterances dominated by powerlaws • High Frequency core • Low Frequency heavy tail • Open versus closed wordclasses (function words) • Natural Language is open. Grammar is elastic. Occurence of new words is natural phenomenon. Syntactic/semantic bootstrapping must play an important role in language learning. • Bootstrapping will be important for ontology learning as well as child language acquisition • Better understanding of NL distributions is necessary

Learn NL from text: Probabilistic versus Recursion theoretic approach • 1967: Gold. Any language more complex than super-finite sets (including regular and up the Chomsky hierarchy) can not be learned from positive data. • 1969: Horning: Probabilistic context-free grammars can be learned from positive data. Given a text T and two grammars G1 and G2 we are able to approximate max({P(G1|T), P(G2|T)}) • ICGI > 1990: empirical approach. Just build algorithms and try them. Approximate NL from below: Finite  Regular  Context-free  Context-sensitive

Situation < 2004 • GI seems to be hard • No identification in the limit • Ill-understood Powerlaws dominate (word) frequencies in human communication • Machine learning algorithms have difficulties in these domains • PAC learning does not converge on these domains • Nowhere near learning natural languages • We were running out of ideas

Situation < 2004: Learning Regular Languages • Reasonable success in learning Regular languages of moderate complexity (Evidence Based State Merging, Blue-Fringe) • Transparant representation: Deterministic Finite Automata (DFA) • DEMO

Situation < 2004: Learning Context-free Languages • A number of approaches: Learning Probabilistic CFG, Inside-outside Algorithm, Emile, ABL. • No transparant representation: Push Down Automata (PDA) are not really helpful to model the learning process. • No adequate convergence on interesting real life corpora • Problem of sparse data sets. • Complexity issues ill-understood.

Emile: natural language allows bootstrapping • Lewis Caroll's famous poem `Jabberwocky' starts with: • 'Twas brillig, and the slithy toves • Did gyre and gimble in the wabe; • All mimsy were the borogoves • and the mome raths outgrabe.

Emile: Characteristic Expressions and Contexts • An expression of a type T is characteristic for Tif it only appears with contexts of type T • Similarly, a context of a type T is characteristic for Tif it only appears with expressions of type T. • Let G be a grammar (context-free or otherwise) of a language L. G has context separability if each type of G has a characteristic context, and expression separability if each type of G has a characteristic expression. • Natural languages seem to be context- and expression-separable. • This is nothing but stating that languages can define their own concepts internally (...is a noun, ...is a verb).

Emile: Natural languages are shallow • A class of languagesC is shallow if for each language Lit is possible to find a context- and expression-separable grammar G,and a set of sentences Sinducing characteristic contexts and expressions for all the types of G,such that the size of S and the length of the sentences of Sare logarithmic in the descriptive length of L(relative to C). • Seems to hold for natural languages  Large dictionaries, low thickness

The EMILE learning algorithm • One can prove that, using clustering techniques, shallow CFG’s can be learned efficiently from positive examples drawn under m. • General idea:   \/ = sentence = expression\/ = context       ’ 

EMILE 4.1 (2000): Vervoort • Unsupervised • Two dimensional clustering: random search for maximized blocks in the matrix • Incremental: thresholds for filling degree of blocks • Simple (but sloppy) rule induction using characteristic expressions

Clustering (2-dimensional) • John makes coffee • John likes coffee • John is eating • John makes tea • John likes tea • John likes eating •    \  / 

Emile 4.1 Clustering Sparse Matrices of Contexts and Expressions Contexts Characteristic Expression Expressions Characteristic Context

Bible books • King James version • 31102 verses of 82935 lines • 4,8 Mb of English text • 001:001 In the beginning God created the heaven and the earth. • 66 Experiments with increasing sample size • Initially Book Genesis, Book Exodus, … • Full run: 40 minutes, 500 Mb on Ultra-2 Sparc

Bible books

GI on the bible • [0]  Thou shall not [582] • [0]  Neither shalt thou [582] • [582]  eat it • [582]  kill . • [582]  commit adultery . • [582]  steal . • [582]  bear false witness against thy neighbour . • [582]  abhor an Edomite

Dictionary Type [76] Esau, Isaac, Abraham, Rachel, Leah, Levi, Judah, Naphtali, Asher, Benjamin, Eliphaz, Reuel, Anah, Shobal, Ezer, Dishan, Pharez, Manasseh, Gershon, Kohath, Merari, Aaron, Amram, Mushi, Shimei, Mahli,Joel, Shemaiah, Shem, Ham, Salma, Laadan, Zophah, Elpaal, Jehieli Dictionary Type [362] plague, leprosy Dictionary Type [414] Simeon, Judah, Dan, Naphtali, Gad, Asher, Issachar, Zebulun, Benjamin, Gershom Dictionary Type [812] two, three, four Dictionary Type [1056] priests, Levites, porters, singers, Nethinims Dictionary Type [978] afraid, glad, smitten, subdued Dictionary Type [2465] holy, rich, weak, prudent Dictionary Type [3086] Egypt, Moab, Dumah, Tyre, Damascus Dictionary Type [4082] heaven, Jerusalem Knowledge base in Bible

Evaluation • + Works efficiently on large corpora • + learns (partial) grammars • + unsupervised • - EMILE 4.1 needs a lot of input. • - Convergence to meaningful syntactic type rarely observed. • - Types seem to be semantic rather than syntactic. • Why? • Hypothesis: distribution in real life text is semantic, not syntactic. • But, most of all: Sparse data!!!

2004 Omphalos Competition: Starkie & van Zaanen • Unsupervised learning of context-free grammars • Deliberately constructed to be beyond current state of the art • A theoretical brute force learner that constructs all possible CFG consistent with a certain set of positive examples O. • Complexity measure for CFG’s. • There are only:2i(2|Oj| -2) + 1) + (i(2|Oj| -2) + 1) ) * T(O)of these grammars, where T(O) is the number of terminals!!

2004 Omphalos Competition: Starkie & van Zaanen Let  be an alphabet,  the set of all strings over  L(G) =S   is the language generated by a grammar G CG  S is a characteristic sample for G  (infinite) S (infinite) O (finite Omphalos sample) CG |O| < 20 |CG|

Bad news for distributional analysis (Emile, ABL, Inside out) w {an bn} [a,S]  push X [a,X]  push X aeb aaebb aaaebbb … [b,X]  pop X 0 1 1 [e,X]  no-op [,X]  pop w {{a,c}n {b,d}n} [a,S]  push X [a,X]  push X aeb ceb aed ced aaebb aaebd caebb caebd acebb … aaaebbb … [b,X]  pop X 1 0 1 [e,X]  no-op [,X]  pop [c,S]  push X [c,X]  push X [d,X]  pop X

    Bad news for distributional analysis (Emile, ABL, Inside out) a aeb b a aeb d c aeb b c aeb d a ceb b a ceb d c ceb b c ceb d … a aeb b a aaebb b We need large corpora to make distributional analysis working. Omphalos samples are way to small!!

Omphalos won by Alexander Clark: some good ideas! Approach: • Exploit useful properties that randomly generatedgrammars are likely to have • Identifying constituents: Measure local mutual information between symbol beforeand symbol after.[ Clark, 2001]. More reliable than other information theoretic constituentboundary tests. [Lamb, 1961] [Brill et al., 1990] • Under benign distributions non-constituents will have zero mutual information crossing constituent boundaries. Structures that do not cross constituent boundaries will have non-zero mutual information. • Analysis of cliques of strings that might be constituents (Much like clusters in EMILE). • Most hard problems in Omphalos still open!!

But, is Omphalos the right challenge? What about NL? Natural Languages Need for larger samples Shallow languages Harder to learn Log # of terminals || Omphalos Complexity of the grammar |P|/|N|

ADIOS (Automatic DIstillation Of Structure) Solan et al. 2004 • Representation of a corpus (of sentences) as paths over a graph whose vertices are lexical elements (words) • Motif Extraction (MEX) procedure for establishing new vertices thus progressively redefining the graph in an unsupervised fashion • Recursive Generalization • Zach Solan, David Horn, Eytan Ruppin (Tel Aviv University) & Shimon Edelman (Cornell) • http://www.tau.ac.il/~zsolan

cat ? node edge where (1) 101 (2) (5) 104 (6) (1) 101 (2) BEGIN is (1) (2) 102 END (6) (5) 104 103 (2) (7) 103 (3) and (1) (6) 104 (4) (3) 102 (4) the (5) 102 101 (3) that a (3) (4) (6) horse (5) (4) dog The Model (Solan et al. 2004) • Graph representation with words as vertices and sentences as paths. And is that a horse? Is that a dog? Where is the dog? Is that a cat?

The MEX (motif extraction) procedure (Solan et al. 2004)

Generalization (Solan et al. 2004)

From MEX to ADIOS (Solan et al. 2004) Apply MEX to search-path consisting of a given data-path. On same search-path, within a given window size, allow for the occurrence of an equivalence class, i.e. define a generalized search-path of the type e1-> e2->…-> {E} ->…->ek. Apply MEX to this window. Choose patterns P, including equivalence classes E according to MEX ranking. Add nodes. Repeat the above for all search-paths. Repeat the procedure to obtain higher level generalizations. Express structures in syntactic trees.

First pattern formation Higher hierarchies: patterns (P) constructed of other Ps, equivalence classes (E) and terminals (T) Trees to be read from top to bottom and from left to right Final stage: root pattern CFG: context free grammar

Solan et al. 2004 • The ADIOS algorithm has been evaluated using artificial grammars containing thousands of rules, natural languages as diverse as English and Chinese, regulatory and coding regions in DNA sequences and functionally relevant structures in protein data. • Complexity of ADIOS on large NL corpora seems to be linear in the size of the corpus. • Allows mild context sensitive learning • This is the first time an unsupervised algorithm is shown capable of learning complex syntax, and score well in standard language proficiency tests!! (Trainingset 300.000 sentences from CHILDES, ADIOS scoring intermediate level (58%) in Göteborg/ESL test).

ADIOS learning from ATIS-CFG (4592 rules)using different numbers of learners, and different window length L

Where does ADIOS fit in? Natural Languages ADIOS Need for larger samples Shallow languages Harder to learn # of terminals || Omphalos Complexity of the grammar |P|/|N|

GI Research Questions • Research Question: What is the complexity of human language? • Research Question: Can we make a formal model of language development of young children that allows us to understand: • Why the process is efficient? • Why the process is discontinuous? • Underlying Research Question: Can we learn natural language efficiently from text? How much text is needed? How much processing is needed? • Research Question: Semantic learning: e.g. can we construct ontologies for specific domains from (scientific) text?

Conclusions & Further work • We start to crack the code of unsupervised learning of human languages • ADIOS is the first algorithm capable of learning complex syntax, and scoring well in standard language proficiency tests • We have better statistical techniques to separate constituents form non-constituents. • Good ideas: pseudo graph representation, MEX, sliding windows. To be done: • Can MEX help us in DFA induction? • Better understanding of the complexity issues. When does MEX collapse? • Better understanding of Semantic Learning • Incremental Learning with background knowledge • Use GI to learn ontologies

Human Computer Studies

Human Computer Studies

Presentation Transcript

Human Computer Interaction

Human Computer Interaction

Human Computer Interaction

COMPUTER STUDIES 7010

Computer Studies (AL)

Human Computer Interaction

Human-Computer Interface

Computer Studies I

Human Computer Interaction

Human-Computer Interaction

Computer Studies (option)

Computer Studies

Computer Studies

Human Computer Interaction

Computer Studies

AS Computer Studies

Computer Studies I

AS Computer Studies