Grammar Induction

Grammar Induction So what did we have?

cat ? node edge where (1) 101 (2) (5) 104 (6) (1) 101 (2) BEGIN is (1) (2) 102 END (6) (5) 104 103 (2) (7) 103 (3) and (1) (6) 104 (4) (3) 102 (4) the (5) 102 101 (3) that a (3) (4) (6) horse (5) (4) dog The Model: Graph representation with words as vertices and sentences as paths. And is that a horse? Is that a dog? Where is the dog? Is that a cat?

Detecting significant patterns • Identifying patterns becomes easier on a graph • Sub-paths are automatically aligned

Motif EXtraction

Pattern significance • Say we found a potential pattern-edge from nodes 1 to n. Define • m - the number of paths from 1 to n • r – the number of paths from 1 to n+1 • Because it’s a pattern edge, we know that • Let’s suppose that the true probability for n+1 given 1 through n is • r/m is our best estimate, but just an estimate • What are the odds of getting r and m but still have ?

Pattern significance • Assume • The odds of getting result r and m or better are then given by • If this is smaller than a predetermined α, we say the pattern-edge candidate is significant

Rewiring the graph Once a pattern is identified as significant, the sub-paths it subsumes are merged into a new vertex and the graph is rewired accordingly. Repeating this process, leads to the formation of complex, hierarchically structured patterns.

Evaluating performance • Define • Recall – the probability of ADIOS recognizing an unseen grammatical sentence • Precision – the proportion of grammatical ADIOS productions • Recall can be assessed by leaving out some of the training corpus • Precision is trickier • Unless we’re learning a known CFG

Determining L • Involves a tradeoff • Larger L will demand more context sensitivity in the inference • Will hamper generalization • Smaller L will detect more patterns • But many might be spurious

The effects of context window width

An ADIOS drawback • ADIOS is inherently a heuristic and greedy algorithm • Once a pattern is created it remains forever – errors conflate • Sentence ordering affects outcome • Running ADIOS with different orderings gives patterns that ‘cover’ different parts of the grammar

An ad-hoc solution • Train multiple learners on the corpus • Each on a different sentence ordering • Create a ‘forest’ of learners • To create a new sentence • Pick one learner at random • Use it to produce sentence • To check grammaticality of given sentence • If any learner accepts sentence, declare as grammatical

The ATIS experiments • ATIS-NL is a 13,043 sentence corpus of natural language • Transcribed phone calls to an airline reservation service • ADIOS was trained on 12,700 sentences of ATIS-NL • The remaining 343 sentences were used to assess recall • Precision was determined with the help of 8 graduate students from Cornell University

The ATIS experiments • ADIOS’ performance scores (40 learners) – • Recall – 40% • Precision – 70% • For comparison, ATIS-CFG reached – • Recall – 45% • Precision - <1%(!)

ADIOS/ATIS-N comparison

Meta-analysis of ADIOS results • Define a pattern spectrum as the histogram of pattern types for an individual learner • A pattern type is determined by its contents • E.g. TT, TET, EE, PE… • A single ADIOS learner was trained with each of 6 translations of the bible

Pattern spectra

Language dendogram

So why doesn’t it work?

Our experience • ADIOS does nicely on • ATIS-N • Childes • Artificial CFGs • Fails miserably on almost anything else • The Wall-Street Journal • Children’s literature • The Bible

Results • CHILDES • Very high recall + precision • The ESL test • ATIS-N • Up to 70% recall (with 700 learners) • Superior language model • Children’s lit • Very few patterns are detected

Some example sentences • Childes • baby go ing to go up the ladder ? • the dog won 't sit in the chaise lounge . • take the lady for a ride • Atis-n • i would like one coach reservation for may ninth from pittsburgh to atlanta leaving pittsburgh before ten o'clock in the morning • where is the stopover of american airlines flight five four five nine • what are the flights from boston to washington on october fifteenth nineteen ninety one

Some example sentences • Children’s lit • The Tin Woodman and the Scarecrow didn ' t mind the dark at all , but Woot the Wanderer felt worried to be left in this strange place in this strange manner , without being able to see any danger that might threaten . • I know that some of you have been waiting for this story of the Tin Woodman , because many of my correspondents have asked me , time and again what ever became of the " pretty Munchkin girl " whom Nick Chopper was engaged to marry before the Wicked Witch enchanted his axe and he traded his flesh for tin .

Some corpus statistics

Possible causes for failure I • Sentence complexity and structural diversity • CHILDES and ATIS-N have very few sentence ‘types’ • Most of which are simple, single-clause sentences • Children’s lit has many complex sentences with multiple clauses

Types of complex sentences • Complementary clauses • Peter promised that he would come • Sue wants Peter to leave • Relative clauses • Sally bought the bike that was on sale • Is that the driver causing the accidents? • Adverbial clauses • He arrived when Mary was just about to leave • She left the door open to hear the baby • Coordinate clauses • He tried hard, but he failed

That example again I know that some of you have been waiting for this story of the Tin Woodman , because many of my correspondents have asked me , time and again what ever became of the " pretty Munchkin girl " whom Nick Chopper was engaged to marry before the Wicked Witch enchanted his axe and he traded his flesh for tin .

Possible causes for failure • Sentence complexity and structural diversity • CHILDES and ATIS-N have very few sentence ‘types’ • Most of which are simple, single-clause sentences • Children’s lit has many complex sentences with multiple clauses • The music lesson

Possible remedies • How do children do it? • Incremental learning • On the importance of starting small • How might we mimic that? • Sorting sentences according to complexity • Starting out with a simpler corpus • The problem of the growing lexicon

dogcat horse dogcat horse cow P1: I like the _E1 P1: I like the _E1 _E1 = _E1 = Generalizing patterns New sentence: I like the cow

dogcat horse dogcat horse finer P1: I like the _E1 P1: I like the _E1 _E1 = _E1 = May cause overgeneralization New sentence: I like the finer things in life

dogcat horse P2: I like the red _E1 _E1 = dogcat horse P1: I like the _E1 _E1 = Allowing gaps New sentence: I like the red dog

Another approach • Two-phase learning • Split complex sentences into simple clauses • Learn simple clauses • Combine results back to complex sentences and resume learning • Sidesteps the problem of the growing lexicon • Introduces the problem of identifying clause boundaries

That example again I know that some of you have been waiting for this story of the Tin Woodman , because many of my correspondents have asked me , time and again what ever became of the " pretty Munchkin girl " whom Nick Chopper was engaged to marry before the Wicked Witch enchanted his axe and he traded his flesh for tin .

Possible causes for failure II • Sentence complexity and structural diversity • Lexicon size vs. #sentences • Large lexicon might curtail alignments necessary for generalization

Possible remedies • How do children do it? • Have access to semantic information • Which may be used for alignment • How can we mimic it? • Introducing pre-existing ECs • WordNet • Distributional Clustering • Semantic tagging?

An aside - bootstrapping • Used for very small corpora • Iteratively do – • Train a set of learners on the current corpus • Generate sentences • Replace corpus with generated sentences • Problematic for large corpora • Must be performed by transforming the existing sentences

A word about the code

A little on Java classes • Similar to struct in C • Also allow the definition of class-specific functions • Data members may be • Private – only accessible to class functions • Public – accessible to everyone • Protected – like private, for most of our purposes

The code • Consists of three packages • Com.ADIOS.Model – contains classes defining the graph (graph.java, node.java, edge.java, etc’) • Com.ADIOS.Algorithm – the ‘brains’ of the implementation (most importantly contains MarkovMatrix.java and Trainer.java) • Com.ADIOS.Helpers – various helper classes

The model • Node • EquivalenceClass • Pattern • Edge • Path • Graph

The algorithm • Trainer • MarkovMatrix • also finds new equivalence class • Generator • calculates recall and generates new sentences

The main package • Main • Processes command line arguments (context window width, corpus file name, etc’) • Finals • A repository of constants used throughout the code

The Model – Node.java • Data members • Label, inEdges, outEdges • Nontrivial functions • getOutEdges(Vector inEdges) • Returns the edges going out if this node that come from inEdges • getInEdges(Vector outEdges) • Same, only in other direction

The model – EquivalenceClass.java • Inherits from Node • Additional data members – • Nodes • Nontrivial functions – • getOutEdges(), getOutEdges(Vector inEdges) • Same as in Node, only sums for all constituent nodes

The model – Pattern.java • Inherits from Node • Additional data members • Id, path (the pattern specification)

The model – Path.java • Data members – • Id, nodes • Nontrivial functions – • Init(StringTokenizer st) – inits the path according to a line of text • Squeeze(Pattern p, int, int) – finds the instances of p in the path and replaces them by the single node p • Does not rewire the graph!

The model – Edge.java • Data members – • fromNode, toNode • prevEdge, nextEdge, • path • No nontrivial functions

The model – Graph.java • Main data members – • nodes, edges, paths, equivalenceClasses, patterns • Nontrivial functions – • addPattern(Pattern p) – rewires the graph • Print functions – print various data to files

The algorithm – MarkovMatrix.java • Main data members – • path, matrix, pathsCountMatrix • winSize, winIndex, wildcardIndex • ec • Nontrivial functions – • findWildcardCandidate() – generates the new equivalence class in the wildcard position • initMarkovMatrix() – calculates the matrix

Grammar Induction