420 likes | 502 Views
MDLによる統語論の一般化に関する試論. ダウマン・マイク 2007 − 06 − 07. learning. E. I. What should Syntactic Theory Explain?. Which sentences are grammatical and which are not or How to transform observed sentences into a grammar. Children transform observed sentences ( E )
E N D
MDLによる統語論の一般化に関する試論 ダウマン・マイク 2007−06−07
learning E I What should Syntactic Theory Explain? • Which sentences are grammatical and which are not or • How to transform observed sentences into a grammar Children transform observed sentences (E) Into psychological knowledge of language (I)
Poverty of the Stimulus • Evidence available to children is utterances produced by other speakers • No direct cues to sentence structure • Or to word categories So children need prior knowledge of possible structures UG
How should we study syntax? Linguists’ Approach: • Choose some sentences • Decide on grammaticality of each one • Make a grammar that accounts for which of these sentences are grammatical and which are not sentences grammar Informant Linguist
Computational Linguists’ Approach(Unsupervised Learning) • Take a corpus • Extract as much information from the corpus as accurately as possible or • Learn a grammar that describes the corpus as accurately as possible corpus grammar lexical items language model etc.
Which approach gives more insight into language? Linguists tend to aim for high precision • But only produce very limited and arbitrary coverage Computational linguists tend to obtain much better coverage • But don’t account for any body of data completely correctly • And tend only to learn only simpler kinds of structure Approaches seem to be largely complementary
Which approach gives more insight into the human mind? The huge size and complexity of languages is one of their key distinctive properties The linguists’ approach doesn’t account for this So should we apply our algorithms to large corpora of naturally occurring data? This won’t directly address the kind of issue that linguists focus on
Negative Evidence • Some constructions seem impossible to learn without negative evidence John gave a painting to the museum John gave the museum a painting John donated a painting to the museum * John donated the museum a painting
Implicit Negative Evidence If constructions don’t appear can we just assume they’re not grammatical? No – we only see a tiny proportion of possible, grammatical sentences • People generalize from examples they have seen to form new utterances ‘[U]nder exactly what circumstances does a child conclude that a nonwitnessed sentence is ungrammatical?’ (Pinker, 1989)
Minimum Description Length (MDL) MDL may be able to solve the poverty of the stimulus problem Prefers the grammar that results in the simplest overall description of data • So prefers simple grammars • And grammars that result in simple descriptions of the data Simplest means specifiable using the least amount of information
Observed sentences Space of possible sentences
Observed sentences Space of possible sentences Simple but non-constraining grammar Grammar
Observed sentences Space of possible sentences Simple but non-constraining grammar Grammars Complex but constraining grammar
Space of possible sentences Simple but non-constraining grammar Observed sentences Grammars Complex but constraining grammar Grammar that is a good fit to the data
MDL and Bayes’ Rule • h is a hypothesis (= grammar) • d is some data (= sentences) • The probability of a grammar given some data is equal to its a priori probability times how likely the observed sentences would be if that grammar were correct
MDL and Prior (Innate?) Bias • MDL solves the difficult problem of deciding prior probability for each grammar • But MDL is still subjective – the prior bias is just hidden in the formalism chosen to represent grammars, and in the encoding scheme
Why it has to be MDL Many machine learning techniques have been applied in computational linguistics MDL is very rarely used Only modest success at learning grammatical structure from corpora So why MDL?
Maximum Likelihood Maximum likelihood can be seen as a special case of MDL in which the a priori probability of all hypotheses P(h) is equal But the hypothesis that only the observed sentences are grammatical will result in the maximum likelihood So ML can only be applied if there are restrictions on how well the estimated parameters can fit the data The degree of generality of the grammars is set externally, not determined by the Maximum Likelihood principle
Maximum Entropy Make the grammar as unrestrictive as possible But constraints must be used to prevent a grammar just allowing any combination of words to be a grammatical sentence Again the degree of generality of grammars is determined externally Neither Maximum Likelihood nor Maximum Entropy provide a principle that can decide when to make generalizations
Learning Phrase Structure Grammars • Binary or non-branching rules: S B C B E C tomato • All derivations start from special symbol S
Encoding Grammars Grammars can be coded as lists of three symbols • null symbol in 3rd position indicates a non-branching rule First symbol is rules left hand side, second and third its right hand side S, B, C, B, E, null, C, tomato, null
Statistical Encoding of Grammars • First we encode the frequency of each symbol • Then encode each symbol using the frequency information S, B, C, B, E, null, C, tomato, null I(S) = -log 1/9 I(null) = -log 2/9 Uncommon symbols have a higher coding length than common ones
Encoding Data 1 S NP VP 2 NP john 3 NP mary 4 VP screamed 5 VP died Data encoding: 1, 2, 4, 1, 2, 5, 1, 3, 4 There is a restricted range of choices at each stage of the derivation Fewer choices = higher probability Data: John screamed John died Mary Screamed
Statistical Encoding of Data If we record the frequency of each rule, this information can help us make a more efficient encoding • 1 S NP VP (3) • 2 NP john (2) • 3 NP mary (1) • 4 VP screamed (2) • 5 VP died (1) Data: 1, 2, 4, 1, 2, 5, 1, 3, 4 Probabilities: 1 3/3, 2 2/3, 4 2/3, 1 3/3, 2 2/3… Total frequency for S = 3 Total frequency for NP = 3 Total frequency for VP = 3
Encoding in My Model Number of bits decoded = evaluation Decoder 1010100111010100101101010001100111100011010110 Grammar Symbol Frequencies Rule Frequencies Data Rule 1 3 Rule 2 2 Rule 3 1 Rule 4 2 Rule 5 1 John screamed John died Mary Screamed 1 S NP VP 2 NP john 3 NP mary 4 VP screamed 5 VP died S (1) NP (3) VP (3) john (1) mary (1) screamed (1) died (1) null (4)
Creating Candidate Grammars • Start with simple grammar that allows all sentences • Make simple change and see if it improves the evaluation (add a rule, delete a rule, change a symbol in a rule, etc.) • Annealing search • First stage: just look at data coding length • Second stage: look at overall evaluation
Example: English Learned Grammar S NP VP VP ran VP screamed VP Vt NP VP Vs S Vt hit Vt kicked Vs thinks Vs hopes NP John NP Ethel NP Mary NP Noam John hit Mary Mary hit Ethel Ethel ran John ran Mary ran Ethel hit John Noam hit John Ethel screamed Mary kicked Ethel John hopes Ethel thinks Mary hit Ethel Ethel thinks John ran John thinks Ethel ran Mary ran Ethel hit Mary Mary thinks John hit Ethel John screamed Noam hopes John screamed Mary hopes Ethel hit John Noam kicked Mary
Real Language Data Can the MDL metric also learn grammars from corpora of unrestricted natural language? If it could, we’d largely have finished syntax But search space is way too big • Need to simplify the task in some way • Only learn verb subcategorization classes
( (S (C C and) ( PR N ( , ,) ( S (NP - S BJ ( P R P y o u) ) ( V P ( VB P k n ow) )) ( , ,) ) ( N P - S B J - 1 ( P R P she ) ) (V P (V BD s p ent ) ( N P ( N P (C D nine ) (N N S mo n ths) ) ( P P (I N o u t) (P P ( IN of ) ( Switchboard Corpus Extra c te d I n fo r ma t io n : V e r b : spe n t Subc a tegoriz a ti o n fra m e : * N P S N P ( D T the ) (NN year ) )))) ( S - A DV (NP - S BJ ( - N O NE - * - 1) ) (A DV P ( RB ju s t) ) ( V P ( VB G vis i ting ) ( N P ( P R P $ her ) (NN S chi l dren) )))) ( . .) ( - D FL - E _ S) ))
Extracted Data Only verbs tagged as VBD (past tense) extracted Modifiers to basic labels ignored 21,759 training instances 704 different verbs 706 distinct subcategorization frames 25 different types of constituent appeared alongside the verbs (e.g. S, SBAR, NP, ADVP)
Verb Class Grammars S Class1 Subcat1 S Class1 Subcat2 S Class2 Subcat1 Class1 grew Class1 ended Class2 do grew and ended appear can appear with subcats 1 and 2 do only with subcat 2 Grouping together verbs with similar subcategorizations should improve the evaluation
A New Search Mechanism We need a search mechanism that will only produce candidate grammars of the right form • Start with all verbs in one class • Move a randomly chosen verb to a new class (P=0.5) or a different class (P=0.5) • Empty verb classes are deleted • Redundant rules are removed
A New Search Mechanism (2) Annealing search: • After no changes are accepted for 2,000 iterations switch to merging phase • Merge two randomly selected classes • After no changes accepted for 2,000 iterations switch back to moving phase • Stop after no changes accepted for 20,000 iterations • Multiple runs were conducted and the grammar with the overall lowest evaluation selected
One verb class Each verb in a separate class Best learned grammar Overall Evaluation 250,435.5 298,063.3 245,198.0 Grammar 29,915.1 111,036.5 37,885.5 Data 220,520.4 187,026.7 207,312.4 Grammar Evaluations
Did MDL make appropriate generalizations? The learned verb classes are clearly linguistically coherent But they don’t account for exactly which verbs can appear with which subcats Linguists have proposed far more fine-grained classes Data available for learning was limited (subcats had no internal structure, Penn Treebank labels may not be sufficient) But linguists can’t explain which verbs appear with which subcats either
produce learn I1 E1 I2 E2 I3 Is there a correct learning mechanism? The learned grammar will only reflect the I-languageI3if the learning mechanism makes the same kind of generalizations from the E-languageE2 as do people grammar Computer model
Why do we think MDL is likely to be a good learning mechanism? MDL is a general purpose learning principle But language is the product of the learning mechanisms of other people So is there any reason to suppose that a general purpose learning mechanism would be good at learning language?
Should a general purpose learning principle like MDL be able to learn language? When language first arose did people recruit a pre-existing learning mechanism to the task of learning language? If so the LAD should learn in a fairly generic and sensible way But have we evolved mechanisms specific to language learning? If a mechanism is specific to language learning, it should work just as well if the generalizations are idiosyncratic as if they are sensible
Should a general purpose learning principle like MDL be able to learn language? When language first arose did people recruit a pre-existing learning mechanism to the task of learning language? If so the LAD should learn in a fairly generic and sensible way But have we evolved mechanisms specific to language learning? If a mechanism is specific to language learning, it should work just as well if the generalizations are idiosyncratic as if they are sensible
Conclusions • MDL (and only MDL) can determine when to make linguistic generalizations and when not to • Lack of negative evidence is not a particular problem when using MDL • The same MDL metric can be used both on small sets of example sentences and on unrestricted corpora