430 likes | 534 Views
Minimum Description Length An Adequate Syntactic Theory?. Mike Dowman 3 June 2005. Language Acquisition Device. Individual's Knowledge of Language. Primary Linguistic Data. Linguistic Theory. Chomsky’s Conceptualization of Language Acquisition . Individual's Knowledge of Language.
E N D
Minimum Description LengthAn Adequate Syntactic Theory? Mike Dowman 3 June 2005
Language Acquisition Device Individual's Knowledge of Language Primary Linguistic Data Linguistic Theory Chomsky’s Conceptualization of Language Acquisition
Individual's Knowledge of Language Diachronic Theories Language Acquisition Device Primary Linguistic Data Arena of Language Use Hurford’s Diachronic Spiral
Learnability Poverty of the stimulus • Language is really complex Obscure and abstract rules constrain, wh-movement, pronoun binding, passive formation, etc. Examples of E-language don’t give sufficient information to determine this
WH-movement Whoi do you think Lord Emsworth will invite ti? Whoi do you think that Lord Emsworth will invite ti? Whoi do you think ti will arrive first? * Whoi do you think that ti will arrive first?
Negative Evidence • Some constructions seem impossible to learn without negative evidence John gave a painting to the museum John gave the museum a painting John donated a painting to the museum * John donated the museum a painting
Implicit Negative Evidence If constructions don’t appear can we just assume they’re not grammatical? No – we only see a tiny proportion of possible, grammatical sentences • People generalize from examples they have seen to form new utterances ‘[U]nder exactly what circumstances does a child conclude that a nonwitnessed sentence is ungrammatical?’ (Pinker, 1989)
Learnability Proofs Gold (1967) for languages to be learnable in the limit we must have: • Negative evidence • or a priori restrictions on possible languages But learnable in the limit means being sure that we have determined the correct language
Statistical Learnability Horning (1969) • If grammars are statistical • so utterances are produced with frequencies corresponding to the grammar • Languages are learnable • But we can never be sure when the correct grammar has been found • This just gets more likely as we see more data
Horning’s Proof • Used Bayes’ rule • More complex grammars are less probable a priori P(h) • Statistical grammars can assign probabilities to data P(d | h) • Search through all possible grammars, starting with the simplest
MDL Horning’s evaluation method for grammars can be seen as a form of Minimum Description Length • Simplest is best (Occam’s Razor) • Simplest means specifiable with the least amount of information Information theory (Shannon, 1948) allows us to link probability and information: Amount of information = -log Probability
Encoding Grammars and Data Decoder 1010100111010100101101010001100111100011010110 Grammar Data coded in terms of grammar The comedian died A kangaroo burped The aeroplane laughed Some comedian burped A B C B D E E {kangaroo, aeroplane, comedian} D {the, a, some} C {died, laughed, burped}
Complexity and Probability • More complex grammar • Longer coding length, so lower probability • More restrictive grammar • Less choices for data, so each possibility has a higher probability
Most restrictive grammar just lists all possible utterances • Only the observed data is grammatical, so it has a high probability • A simple grammar could be made that allowed any sentences • Grammar would have a high probability • But data a very low one MDL finds a middle ground between always generalizing and never generalizing
Rampant Synonymy? • Inductive inference (Solomonoff, 1960a) • Kolmogorov complexity (Kolmogorov, 1965) • Minimum Message Length (Wallace and Boulton, 1968) • Algorithmic Information Theory (Chaitin, 1969) • Minimum Description Length (Rissanen, 1978) • Minimum Coding Length (Ellison, 1992) • Bayesian Learning (Stolcke, 1994) • Minimum Representation Length (Brent, 1996)
Evaluation and Search • MDL principle gives us an evaluation criterion for grammars (with respect to corpora) • But it doesn’t solve the problem of how to find the grammars in the first place Search mechanism needed
Two Learnability Problems • How to determine which of two or more grammars is best given some data • How to guide the search for grammars so that we can find the correct one, without considering every logically possible grammar
MDL in Linguistics • Solomonoff (1960b): ‘Mechanization of Linguistic Learning’ • Learning phrase structure grammars for simple ‘toy’ languages: Stolcke (1994), Langley and Stromsten (2000) • Or real corpora: Chen (1995), Grünwald (1994) • Or for language modelling in speech recognition systems: Starkie (2001)
Not Just Syntax! • Phonology: Ellison (1992), Rissanen and Ristad (1994) • Morphology: Brent (1993), Goldsmith (2001) • Segmenting continuous speech: de Marcken (1996), Brent and Cartwright (1997)
MDL and Parameter Setting • Briscoe (1999) and Rissanen and Ristad (1994) used MDL as part of parameter setting learning mechanisms MDL and Iterated Learning • Briscoe (1999) used MDL as part of an expression-induction model • Brighton (2002) investigated effect of bottlenecks on an MDL learner • Roberts et al (2005) modelled lexical exceptions to syntactic rules
An Example: My Model Learns simple phrase structure grammars • Binary or non-branching rules: A B C D E F tomato • All derivations start from special symbol S • null symbol in 3rd position indicates non-branching rule
Encoding Grammars Grammars can be coded as lists of three symbols • First symbol is rules left hand side, second and third its right hand side A, B, C, D, E, null, F, tomato, null • First we have to encode the frequency of each symbol
Encoding Data 1 S NP VP (3) 2 NP john (2) 3 NP mary (1) 4 VP screamed (2) 5 VP died (1) Data: 1, 2, 4, 1, 2, 5, 1, 3, 4 Probabilities: 1 3/3, 2 2/3, 4 2/3, 1 3/3, 2 2/3… We must record the frequency of each rule Total frequency for S = 3 Total frequency for NP = 3 Total frequency for VP = 3
Encoding in My Model Decoder 1010100111010100101101010001100111100011010110 Grammar Symbol Frequencies Rule Frequencies Data Rule 1 3 Rule 2 2 Rule 3 1 Rule 4 2 Rule 5 1 John screamed John died Mary Screamed 1 S NP VP 2 NP john 3 NP mary 4 VP screamed 5 VP died S (1) NP (3) VP (3) john (1) mary (1) screamed (1) died (1) null (4)
Search Strategy • Start with simple grammar that allows all sentences • Make simple change and see if it improves the evaluation (add a rule, delete a rule, change a symbol in a rule, etc.) • Annealing search • First stage: just look at data coding length • Second stage: look at overall evaluation
Example: English Learned Grammar S NP VP VP ran VP screamed VP Vt NP VP Vs S Vt hit Vt kicked Vs thinks Vs hopes NP John NP Ethel NP Mary NP Noam John hit Mary Mary hit Ethel Ethel ran John ran Mary ran Ethel hit John Noam hit John Ethel screamed Mary kicked Ethel John hopes Ethel thinks Mary hit Ethel Ethel thinks John ran John thinks Ethel ran Mary ran Ethel hit Mary Mary thinks John hit Ethel John screamed Noam hopes John screamed Mary hopes Ethel hit John Noam kicked Mary
Dative Alternation • Children learn distinction between alternating and non-alternating verbs • Previously unseen verbs are used productively in both constructions New verbs follow regular pattern • During learning children use non-alternating verbs in both constructions U-shaped learning
Training Data • Three alternating verbs: gave, passed, lent • One non-alternating verb: donated • One verb seen only once: sent The museum lent Sam a painting John gave a painting to Sam Sam donated John to the museum The museum sent a painting to Sam
Grammar Properties • Learned grammar distinguishes alternating and non-alternating verbs • sent appears in alternating class • With less data, only one class of verbs, so donated can appear in both constructions • All sentences generated by the grammar are grammatical • But structures are not right
Learned Structures S X Y Z NP NP VA DET N P NP John gave a painting to Sam
Regular and Irregular Rules • Why does the model place a newly seen verb in the regular class? Y VA NP Y VA Z Y VP Z VA passed VA gave VA lent VP donated VA/ VP sent Regular constructions are preferred because the grammar is coded statistically
Why use Statistical Grammars? Statistics are a valuable source of information • They help to infer when absences are due to chance The learned grammar predicted that sent should appear in the double object construction • but in 150 sentences it was only seen in the prepositional dative construction • With a non statistical grammar we need an explanation as to why this is • A statistical grammar knows that sent is rare, which explains the absence of double object occurrences
Scaling Up: Onnis, Roberts and Chater (2003) Causative alternation: John cut the string * The string cut * John arrived the train The train arrived John bounced the ball The ball bounced
Onnis et al’s Data • Two word classes: N and V • NV and VN only allowable sentences 16 verbs alternate: NV + VN 10 verbs NV only 10 verbs VN only Coding scheme marks non-alternating verbs as exceptional (cost in coding scheme)
Onnis et al’s Results < 16,000 sentences all verbs alternate > 16,000 sentences non alternating verbs classified as exceptional No search mechanism Just looked at evaluations with and without exceptions In expression-induction model quasi-regularities appear as a result of chance omissions
MDL and MML Issues • Numeric parameters - accuracy • Bayes’ optimal classification (not MAP learning) – Monte Carlo methods If we see a sentence, work out the probability of it for each grammar Weighted sum gives probability of sentence • Unseen data – zero probability?
One and Two Part Codes Decoder 1010100111010100101101010001100111100011010110 Grammar Data coded in terms of grammar 1010100111010100101101010001100111100011010110 Decoder Data and grammar combined Data Grammar
Coding English Texts Grammar is a frequency for each letter and for space • Counts start at one • We decode a series of letters – and update the counts for each letter • All letters coded in terms of their probabilities at that point in the decoding • At end we have a decoded text and grammar
One-Part Grammars Grammars can also be coded using one-part codes • Start with no grammar, but have a probability associated with adding a new rule • Each time we decode data we first choose to add a new rule, or use an existing one Examples are Dowman (2000) or Venkataraman (1997)
Conclusions • MDL can solve the poverty of the stimulus problem • But it doesn’t solve the problem of constraining the search for grammars • Coding schemes create learning biases • Statistical grammars and statistical coding of grammars can help learning