1 / 103

A Bayesian Approach to the Poverty of the Stimulus

A Bayesian Approach to the Poverty of the Stimulus. Amy Perfors MIT. With Josh Tenenbaum (MIT) and Terry Regier (University of Chicago). Innate. Learned. Explicit Structure. Innate. Learned. No explicit Structure. Language has hierarchical phrase structure. No. Yes.

jesus
Download Presentation

A Bayesian Approach to the Poverty of the Stimulus

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Bayesian Approach to the Poverty of the Stimulus Amy Perfors MIT With Josh Tenenbaum (MIT) and Terry Regier (University of Chicago)

  2. Innate Learned

  3. Explicit Structure Innate Learned No explicit Structure

  4. Language has hierarchical phrase structure No Yes

  5. Why believe that language has hierarchical phrase structure? • Formal properties + information-theoretic, simplicity-based argument (Chomsky, 1956) • Dependency structure of language: • A finite-state grammar cannot capture the infinite sets of English sentences with dependencies like this • If we restrict ourselves to only a finite set of sentences, then in theory a finite-state grammar could account for them: “but this grammar will be so complex as to be of little use or interest.”

  6. Why believe that structure dependence is innate? • The Argument from the Poverty of the Stimulus (PoS): Simple declarative: The girl is happy, They are eating Simple interrogative: Is the girl happy? Are they eating? Data • Linear: move the first “is” (auxiliary) in the sentence to the beginning • Hierarchical: move the auxiliary in the main clause to the beginning Hypotheses Test Complex declarative: The girl who is sleeping is happy. Children say: Is the girl who is sleeping happy? NOT: *Is the girl who sleeping is happy? Result Chomsky, 1965, 1980; Crain & Nakayama, 1987

  7. Why believe it’s not innate? • There are actually enough complex interrogatives (Pullum & Scholz 02) • Children’s behavior can be explained via statistical learning of natural language data (Lewis & Elman 01; Reali & Christiansen 05) • It is not necessary to assume a grammar with explicit structure

  8. Explicit Structure Innate Learned No explicit Structure

  9. Explicit Structure Innate Learned No explicit Structure

  10. Our argument

  11. Our argument • We suggest that, contra the PoS claim: • It is possible, given the nature of the input and certain domain-general assumptions about the learning mechanism, that an ideal, unbiased learner can realize that language has a hierarchical phrase structure; therefore this knowledge need not be innate • The reason: grammars with hierarchical phrase structure offer an optimal tradeoff between simplicity and fit to natural language data

  12. Plan • Model • Data: corpus of child-directed speech (CHILDES) • Grammars • Linear & hierarchical • Both: Hand-designed & result of local search • Linear: automatic, unsupervised ML • Evaluation • Complexity vs. fit • Results • Implications

  13. The model: Data • Corpus from CHILDES database (Adam, Brown corpus) • 55 files, age range 2;3 to 5;2 • Sentences spoken by adults to children • Each word replaced by syntactic category • det, n, adj, prep, pro, prop, to, part, vi, v, aux, comp, wh, c • Ungrammatical sentences and the most grammatically complex sentence types were removed: kept 21792 out of 25876 utterances • Topicalized sentences(66); sentences serial verb constructions (459), subordinate phrases (845), sentential complements (1636), and conjunctions (634). Ungrammatical sentences (444)

  14. Data • Final corpus contained 2336 individual sentence types corresponding to 21792 sentence tokens

  15. Data: variation • Amount of evidence available at different points in development

  16. Data: variation • Amount of evidence available at different points in development • Amount comprehended at different points in development

  17. Data: amount available • Rough estimate – split by age Age # Files # types % types Epoch 0 2;3 1 173 7.4% Epoch 1 2;3 to 2;8 879 38% 11 Epoch 2 2;3 to 3;1 1295 55% 22 Epoch 3 2;3 to 3;5 1735 74% 33 2;3 to 4;2 2090 89% Epoch 4 44 55 Epoch 5 2;3 to 5;2 2336 100%

  18. Data: amount comprehended • Rough estimate – split by frequency Frequency # types % types % tokens Level 1 500+ 8 0.3% 28% Level 2 100+ 37 1.6% 55% Level 3 50+ 67 2.9% 64% 25+ 115 4.9% 71% Level 4 Level 5 10+ 268 12% 82% Level 6 2336 100% 100% 1+ (all)

  19. The model • Data • Child-directed speech (CHILDES) • Grammars • Linear & hierarchical • Both: Hand-designed & result of local search • Linear: automatic, unsupervised ML) • Evaluation • Complexity vs. fit

  20. Grammar types Linear Hierarchical “Flat” grammar Regular grammar 1-state grammar Context-free grammar Rules Rules Rules Rules Example List of each sentence Anything accepted NT  t NT NT  NT NT NT  t NT NT  NT NT  t NT  t Example Example Example

  21. Specific hierarchical grammars: Hand-designed CFG-S CFG-L Larger CFG Standard CFG Description Description Designed to be as linguistically plausible as possible Derived from CFG-S; contains additional productions corresponding to different expansions of the same NT (puts less probability mass on recursive productions) Example productions Example productions 77 rules, 15 non-terminals 133 rules, 15 non-terminals

  22. Specific linear grammars: Hand-designed Poor fit, high compression Exact fit, no compression FLAT 1-STATE List of each sentence Anything accepted 2336 rules, 0 non-terminals 26 rules, 0 non-terminals

  23. Specific linear grammars: Hand-designed Poor fit, high compression Exact fit, no compression REG-N FLAT 1-STATE List of each sentence Narrowest regular derived from CFG Anything accepted 2336 rules, 0 non-terminals 289 rules, 85 non-terminals 26 rules, 0 non-terminals

  24. Specific linear grammars: Hand-designed Poor fit, high compression Exact fit, no compression REG-N REG-M FLAT 1-STATE List of each sentence Narrowest regular derived from CFG Mid-level regular derived from CFG Anything accepted 2336 rules, 0 non-terminals 289 rules, 85 non-terminals 169 rules, 14 non-terminals 26 prods, 0 non-terminals

  25. Specific linear grammars: Hand-designed Poor fit, high compression Exact fit, no compression REG-N REG-M REG-B FLAT 1-STATE List of each sentence Narrowest regular derived from CFG Mid-level regular derived from CFG Broadest regular derived from CFG Anything accepted 2336 rules, 0 non-terminals 289 prods, 85 non-terminals 169 prods, 14 non-terminals 117 rules, 10 non-terminals 26 rules, 0 non-terminals

  26. Automated search Local search around hand-designed grammars Linear: unsupervised, automatic HMM learning Goldwater & Griffiths, 2007 Bayesian model for acquisition of trigram HMM (designed for POS tagging, but given a corpus of syntactic categories, learns a regular grammar)

  27. The model • Data • Child-directed speech (CHILDES) • Grammars • Linear & hierarchical • Hand-designed & result of local search • Linear: automatic, unsupervised ML • Evaluation • Complexity vs. fit

  28. Context-free Regular Flat, 1-state Grammars T: type of grammar G: Specific grammar D: Data unbiased (uniform)

  29. Context-free Regular Flat, 1-state Grammars T: type of grammar G: Specific grammar D: Data complexity (prior) data fit (likelihood)

  30. Tradeoff: Complexity vs. Fit • Low prior probability = more complex • Low likelihood = poor fit to the data Fit: high Simplicity: low Fit: low Simplicity: high Fit: moderate Simplicity: moderate

  31. Measuring complexity: prior • Designing a grammar (God’s eye view) • Grammars with more rules and non-terminals will have lower prior probability n = # of nonterminals Ni = # items in production i Pk = # productions of nonterminal k V = vocab size Θk = production probability parameters for k

  32. Measuring fit: likelihood • Probability of that grammar generating the data • Product of the probability of each parse Ex: pro aux det n = 0.25 = 0.5*0.25*1.0*0.25*0.5 = 0.016

  33. Plan • Model • Data: corpus of child-directed speech (CHILDES) • Grammars • Linear & hierarchical • Hand-designed & result of local search • Linear: automated, unsupervised ML • Evaluation • Complexity vs. fit • Results • Implications

  34. Results: data split by frequency levels (estimate of comprehension) Log posterior probability (lower magnitude = better)

  35. Results: data split by age (estimate of availability)

  36. Results: data split by age (estimate of availability) Log posterior probability (lower magnitude = better)

  37. Generalization: How well does each grammar predict sentences it hasn’t seen?

  38. Generalization: How well does each grammar predict sentences it hasn’t seen? Complex interrogatives

  39. Take-home messages • Shown that given reasonable domain-general assumptions, an unbiased rational learner could realize that languages have a hierarchical structure based on typical child-directed input • This paradigm is valuable: it makes any assumptions explicit and enables us to rigorously evaluate how different representations capture the tradeoff between simplicity and fit to data • In some ways, “higher-order” knowledge may be easier to learn than specific details (the “blessing of abstraction”)

  40. Implications for innateness? • Ideal learner • Strong(er) assumptions: • The learner can find the best grammar in the space of possibilities • Weak(er) assumptions • The learner has the ability to parse the corpus into syntactic categories • The learner can represent both linear and hierarchical grammars • Assume a particular way of calculating complexity & data fit • Have we actually found representative grammars?

  41. The End Thanks also to the following for many helpful discussions: Virginia Savova, Jeff Elman, Danny Fox, Adam Albright, Fei Xu, Mark Johnson, Ken Wexler, Ted Gibson, Sharon Goldwater, Michael Frank, Charles Kemp, Vikash Mansinghka, Noah Goodman

  42. Specific linear grammars: Hand-designed Poor fit, high compression Exact fit, no compression REG-N REG-M REG-B FLAT 1-STATE List of each sentence Narrowest regular derived from CFG Mid-level regular derived from CFG Broadest regular derived from CFG Anything accepted 2336 rules, 0 non-terminals 289 prods, 85 non-terminals 169 prods, 14 non-terminals 117 rules, 10 non-terminals 26 rules, 0 non-terminals

  43. Why these results? • Natural language actually is generated from a grammar that looks more like a CFG • The other grammars overfit and therefore do not capture important language-specific generalizations Flat

  44. Computing the prior… CFG REG Context-free grammar NT  NT NT NT  t NT NT  NT NT  t NT  t NT Regular grammar NT  t

  45. Likelihood, intuitively Z: rule out because it does not explain some of the data points X and Y both “explain” the data points, but X is the more likely source

  46. Possible empirical tests • Present people with data the model learns FLAT, REG, and CFGs from; see which novel productions they generalize to • Non-linguistic? To small children? • Examples of learning regular grammars in real life: does the model do the same?

More Related