510 likes | 637 Views
Grammar Induction. An ADIOS Review. ADIOS in outline. Composed of three main elements A representational data structure A segmentation criterion (MEX) A generalization ability We will consider each of these in turn. cat. ?. node. edge. where. (1). 101. (2). (5). 104. (6). (1).
E N D
Grammar Induction An ADIOS Review
ADIOS in outline • Composed of three main elements • A representational data structure • A segmentation criterion (MEX) • A generalization ability • We will consider each of these in turn
cat ? node edge where (1) 101 (2) (5) 104 (6) (1) 101 (2) BEGIN is (1) (2) 102 END (6) (5) 104 103 (2) (7) 103 (3) and (1) (6) 104 (4) (3) 102 (4) the (5) 102 101 (3) that a (3) (4) (6) horse (5) (4) dog The Model: Graph representation with words as vertices and sentences as paths. And is that a horse? Is that a dog? Where is the dog? Is that a cat?
Detecting significant patterns • Identifying patterns becomes easier on a graph • Sub-paths are automatically aligned
Rewiring the graph Once a pattern is identified as significant, the sub-paths it subsumes are merged into a new vertex and the graph is rewired accordingly. Repeating this process, leads to the formation of complex, hierarchically structured patterns.
The Markov Matrix • The top right triangle defines the PL probabilities, bottom left triangle the PR probabilities • Matrix is path-dependent
Pattern significance • Say we found a potential pattern-edge from nodes 1 to n. Define • m - the number of paths from 1 to n • r – the number of paths from 1 to n+1 • Because it’s a pattern edge, we know that • Let’s suppose that the true probability for n+1 given 1 through n is • r/m is our best estimate, but just an estimate • What are the odds of getting r and m but still have ?
Pattern significance • Assume • The odds of getting result r and m or better are then given by • If this is smaller than a predetermined α, we say the pattern-edge candidate is significant
The algorithm so far • Initialization – load data into pseudograph • Until no more patterns are found do • For each path detect all sub-paths that live up to the MEX criterion • Pick best pattern, add it to graph and rewire paths
How to choose patterns • Obviously, the more significant the pattern the better • Turns out it helps choosing longer patterns first when segmenting text • Lowers the probability for accidentally linking words • Also turns out it helps to gradually increase ALPHA
Syntagmatic and Paradigmatic relations • Words can take part of two forms of relations with other words – • Syntagmatic relations – indicating the words appear together in some contexts • Paradigmatic relations – indicating the words can replace one another in a given context • Syntagmatic relations are discovered by MEX • Candidates for paradigmatic relations are established during a preprocessing step for each search path
Generalized search path: boston philadelphia denverdallas show me flights from to san francisco on wednesdays Generalization – defining an equivalence class show me flights from philadelphia to san francisco on wednesdays list all flights from boston to san francisco with the maximum number of stops may i see the flights from denver to san francisco please show flights from dallas to san francisco
boston philadelphia denverdallas P1: from _E1 to _E1 = Generalization boston philadelphia denverdallas show me flights from to san francisco on wednesdays list all flights going from boston to atlanta on wednesday… i need to fly from boston to baltimore please give me… which airlines fly from dallas to denver please give me a flight from philadelphia to atlanta before ten a m in the morning
Context-sensitive generalization • Slide a context window of size L across current search path • For each 1≤i≤L • look at all paths that are identical with the search path for 1≤k≤L, except for k=i • Define an equivalence class containing the nodes at index i for these paths • Replace i’th node with equivalence class • Find significant patterns using MEX criterion
Determining L • Involves a tradeoff • Larger L will demand more context sensitivity in the inference • Will hamper generalization • Smaller L will detect more patterns • But many might be spurious
Generalized search path: believesthinksbelieve john that to please is easy When it all goes wrong • john believes that to please is easy • john thinks that to please is fun • jack and john believe that to please is hard
A pre-existing equivalence class: boston philadelphia denverdallas Generalized search path I: boston philadelphia denverdallas boston philadelphia denverdallas What are the cheapest flights from to that stop in atlanta Bootstrapping what are the cheapest flights from denver to boston that stop in atlanta
Bootstrapping boston philadelphia denverdallas boston philadelphia denverdallas What are the cheapest flights from to that stop in atlanta what is the cheapest fare from denver to philadelphia and from pittsburgh to atlanta i would… like the cheapest airfare from boston to denver december twenty sixth show me the cheapest flight from philadelphia to dallas which arrives…
_P2: the cheapest _E2 from _E3 to _E4 denverphiladelphiadallas flightflightsairfare fare boston philadelphia denver _E2 = _E3 = _E4 = Bootstrapping Generalized search path II: flightflightsairfare fare boston philadelphia denver denverphiladelphiadallas What are the cheapest from to that stop in atlanta
Bootstrapping • Slide a context window of length L along the current search path • Consider all sub-paths of length L that begin in a1 and end in aL • These are the candidate paths • For each 1≤i≤L • For each 1≤k≤L, k≠i • Replace node k with the EC that contains node k and maximally overlaps the set of nodes at index k of the candidate paths • Continue as before
The ADIOS algorithm • Initialization – load all data into a pseudograph • Until no more patterns are found • For each path P • Create generalized search paths from P • Detect significant patterns using MEX • If found, add best new pattern and equivalence classes and rewire the graph
Alternative rewiring tacks • Single mode • as just mentioned. Best pattern is selected and added to graph • Multiple mode • All patterns from the current search path are added to graph in order of significance • Batch mode • The search is conducted over all paths, best patterns added in the end
Evaluating performance • Define • Recall – the probability of ADIOS recognizing an unseen grammatical sentence • Precision – the proportion of grammatical ADIOS productions • Recall can be assessed by leaving out some of the training corpus • Precision is trickier • Unless we’re learning a known CFG
An ADIOS drawback • ADIOS is inherently a heuristic and greedy algorithm • Once a pattern is created it remains forever – errors conflate • Sentence ordering affects outcome • Running ADIOS with different orderings gives patterns that ‘cover’ different parts of the grammar
An ad-hoc solution • Train multiple learners on the corpus • Each on a different sentence ordering • Create a ‘forest’ of learners • To create a new sentence • Pick one learner at random • Use it to produce sentence • To check grammaticality of given sentence • If any learner accepts sentence, declare as grammatical
The Real Deal http://www.tau.ac.il/~zsolan/adios/algorithm.html
The ADIOS executables • A C++/LINUX implementation • There are 4 relevant executables – • adios.exe • The actual implementation of the algorithm • create_graph.exe • Loads a corpus into the ADIOS’ pseudograph • scrambler.exe • Randomizes the order of sentences in a corpus • convert_grammar.exe • Converts a CFG to an ADIOS representation
Preparing the corpus • Each path should be in a line of its own • Starts with a ‘*’ and ends with an ‘#’ • Represent the BEGIN and END nodes, respectively • Words (nodes) separated by spaces * Jim and Cindy have a winning personality # * Beth won't be released until Friday # * a horse barked # * the dog loved a cat # * the cats are living very far away #
Creating the graph • Done by create_graph.exe – ./create_graph.exe –f corpus_file –o proj_name • Two files will be created – • proj_name.idx – an index file containing the list of nodes (the lexicon) and a numeric code for each node • proj_name.grp – a text file describing the pseudograph
Running ADIOS • General usage –./adios.exe [-options] –o proj_name • ADIOS continuously updates and saves the current graph and pattern files – • graph.dat • patterns.dat • sysparams.dat • These files, along with the index file, are important for all other ADIOS operations
Training ADIOS • To train, usually the following parameters are used ./adios.exe –a train –i proj.idx –g proj.grp –E 0.8 –S 0.01 –o proj • Some parameters – • -a – the action to perform (train / test / generate / print) • -i – the index file name • -g – the graph file name • -E – eta (the threshold used by MEX) – default 0.8 • -S – alpha (the significance level required by MEX) – default 0.01 • -o – the project name, which will be used for output and log files
Some additional parameters • -W – the context window width – default 5 (use 1000 for no ECs) • -r – rewiring mode • 0 – no rewiring • 1 – single (the most commonly used) • 2 – multiple • 3 – batch (used for text segmentation) • -A – largest pattern size; all patterns above this size will be treated as equal in the rewiring process (default 1)
Result files • proj.trace.log – a summary of the algorithm’s run • Includes several statistics throughout the processing of the corpus • proj.results.txt – the set of patterns the algorithm has detected, along with a ‘pattern spectra’ analysis • Best viewed with Excel
Resuming training • If ADIOS stalls for some reason, or that you want to continue a run with different parameters (e.g. when incrementing alpha), use –./adios.exe –a train –i proj_name.idx • ADIOS will use the existing graph.dat, patterns.dat and sysparams.dat files to resume its operation
Testing ADIOS • ./adios.exe –a test –i idx_file –I test_file –R 10 –o proj • -I – the file containing the test sentences • -R – determines the maximum depth of the parse trees. Paths that require deeper parse trees will not be accepted. Default value – 10. • Assumes graph.dat and patterns.dat are in same directory
Testing ADIOS • Output files – • proj.test.results.txt • a detailed text file listing the partial parses of each test path • proj.test.summary.txt • a summary file, listing for each test path the patterns accepted on it and whether it’s accepted as a whole • proj.test.classify.txt • a text file with a 0/1 result for each test path (number of accepted sentences = number of lines with a ‘1’ in this file)
Testing multiple learners • Running adios.exe on a second learner will not overwrite proj.classify.txt • Each line will contain the number of learners that accepted the corresponding sentence
Generating new sentences • ./adios.exe –a generate –i proj.idx –n 100 –R 10 –o proj_name • -i – the index file • -n – number of sentences to generate • -R – maximum parse depth • -o – project name
The generator’s output • The output file is proj.generate.txt • Will contain the new sentences in the ADIOS format • Some sentences may be ‘incomplete’ because of the –R option • In these, a ~ symbol will appear • Before using the generated sentences, these should be removed • Use the ‘sed’ command as explained on the webpage
Scrambling sentences • Before creating the graph, the sentences in the input corpus can be scrambled using scrambler.exe. • Usage - ./scrambler.exe –f input_file –o output_file
Using an artificial CFG • An artificial CFG in a proper format can be converted to an ADIOS representation • For testing precision/recall • Using convert_grammar.exe • The CFG should be stored in two files • CFG_lex.txt – a lexicon file • E.g. TA1_lex.txt • CFG_grammar.txt – the rewrite rules • E.g. TA1_grammar.txt
Convert grammar • Usage – ./convert_grammar.exe –l lex_file –g grammar_file –o proj_name • output files – • proj_name.idx – index file • graph.dat – the graph • patterns.dat – the patterns
Displaying patterns • First print the ADIOS learner’s results • ./adios.exe –a print –i proj.idx • Open Matlab and set its workspace to the ADIOS directory • Use the pattern.m script • pattern(123, ‘proj_name’) will graphically display the pattern/EC from the project names proj_name and whose ID is 123