620 likes | 800 Views
EVALUATING MODELS OF PARAMETER SETTING. Janet Dean Fodor Graduate Center, City University of New York. On behalf of CUNY-CoLAG CUNY Computational Language Acquisition Group With support from PSC-CUNY. William G. Sakas , co-director Carrie Crowther Lisa Reisig-Ferrazzano
E N D
EVALUATING MODELS OF PARAMETER SETTING Janet Dean Fodor Graduate Center, City University of New York
On behalf of CUNY-CoLAGCUNY Computational Language Acquisition GroupWith support from PSC-CUNY William G. Sakas, co-director Carrie Crowther Lisa Reisig-Ferrazzano Atsu Inoue Iglika Stoyneshka-Raleva Xuan-Nga Kam Virginia Teller Yukiko Koizumi Lidiya Tornyova Eiji Nishimoto Erika Troseth Artur Niyazov Tanya Viger Iana Melnikova Pugach Sam Wagner www.colag.cs.hunter.cuny.edu
Before we start… Warning: I may skip some slides. But not to hide them from you. Every slide is at our websitewww.colag.cs.hunter.cuny.edu
What we have done • A factory for testing models of parameter setting. • UG + 13 parameter values → 3,072 languages (simplified but human-like). • Sentences of a target language are the input to a learning model. • Is learning successful? How fast? • Why?
Our Aims • A psycho-computational model of syntactic parameter setting. • Psychologically realistic. • Precisely specified. • Compatible with linguistic theory. • And…it must work!
Parameter setting as the solution(1981) • Avoids problems of rule-learning. • Only 20 (or 200) facts to learn. • Triggering is fast & automatic = no linguistic computation is necessary. • Accurate. • BUT: This has never been modeled.
Parameter setting as the problem(1990s) R. Clark, and Gibson & Wexler have shown: • P-setting is not labor-free, not always successful. Because… The parameter interaction problem. The parametric ambiguity problem. • Sentences do not tell which parameter values generated them.
This evening… Parameter setting: • How severe are the problems? • Why do they matter? • How to escape them? • Moving forward: from problems to explorations.
Problem 1: Parameter interaction • Even independent parameters interact in derivations (Clark 1988,1992). • Surface string reflects their combined effects. • So one parameter may have no distinctive isolatable effect on sentences. = no trigger, no cue (cf. cue-based learner; Lightfoot 1991; Dresher 1999) • Parametric decoding is needed. Must disentangle the interactions, to identify which p-values a sentence requires.
Parametric decoding Decoding is not instantaneous. It is hard work. Because… • To know that a parameter value is necessary, must test it in company of all other p-values. • So whole grammars must be tested against the sentence. (Grammar-testing ≠ triggering!) • All grammars must be tested, to identify one correct p-value. (exponential!)
O3 Verb Subj O1[+WH] P Adv. Decoding • This sets: no wh-movt, p-stranding, head initial VP, V to I to C, no affix hopping, C- initial, subj initial, no overt topic marking • Doesn’t set: oblig topic, null subj, null topic
More decoding Adv[+WH] P NOT Verb S KA. • This sets everything except ±overt topic marking. Verb[+FIN]. • This sets nothing, not even +null subject.
Problem 2: Parametric ambiguity • A sentence may belong to more than one language. • A p-ambiguous sentence doesn’t reveal thetarget p-values (even if decoded). • Learner must guess (= inaccurate) or pass (= slow, + when? ) • How much p-ambiguity is there in natural language? Not quantified; probably vast.
Scale of the problem (exponential) • P-interaction and p-ambiguity are likely to increase with the # of parameters. • How many parameters are there? 20 parameters → 220 grammars = over a million 30 parameters → 230 grammars = over a billion 40 parameters → 240 grammars = over a trillion 100 parameters → 2100 grammars = ???
Learning models must scale up • Testing all grammars against each input sentence is clearly impossible. • So research has turned to search methods: how to sample and test the huge field of grammars efficiently. Genetic algorithms (e.g.,Clark 1992) Hill-climbing algorithms (e.g.,Gibson & Wexler’s TLA 1994)
Our approach • Retain a central aspect of classic triggering: Input sentences guide the learner toward the p-values they need. • Decode on-line; parsing routines do the work. (They’re innate.) • Parse the input sentence (just as adults do, for comprehension) until it crashes. • Then the parser draws on other p-values, to find one that can patch the parse-tree.
Structural Triggers Learners (CUNY) • STLs find one grammar for each sentence. • More than that would require parallel parsing, beyond human capacity. • But the parser can tell on-line if there is (possibly) more than one candidate. • If so: guess, or pass (wait for unambig). • Considers only real candidate grammars;directed by what the parse-tree needs.
Summary so far… • Structural triggers learners (STLs) retain an important aspect of triggering (p-decoding). • Compatible with current psycholinguistic models of sentence processing. • Hold promise of being efficient. (Home in on target grammar, within human resource limits.) • Now: Do they really work, in a domain with realistic parametric ambiguity?
Evaluating learning models • Do any models work? • Reliably? Fast? Within human resources? • Do decoding models work better than domain-search (grammar-testing) models? • Within decoding models, is guessing better or worse than waiting?
Hope it works! If not… • The challenge: What is UG good for? • All that innate knowledge, only a few facts to learn, but you can’t say how! • Instead, one simple learning procedure:Adjust the weights in a neural network; Record statistics of co-occurrence frequencies. • Nativist theories of human language are vulnerable until some UG-based learner is shown to perform well.
Non-UG-based learning • Christiansen, M.H., Conway, C.M. and Curtin, S. (2000). A connectionist single-mechanism account of rule-like behavior in infancy. In Proceedings of 22nd Annual Conference of Cognitive Science Society, 83-88. Mahwah, NJ: Lawrence Erlbaum. • Culicover, P.W. and Nowak, A. (2003)A Dynamical Grammar. Oxford, UK: Oxford University Press. Vol.Two of Foundations of Syntax. • Lewis, J.D. and Elman, J.L.(2002) Learnability and the statistical structure of language: Poverty of stimulus arguments revisited. In B. Skarabela et al. (eds) Proceedings of BUCLD 26, Somerville, Mass: Cascadilla Press. • Pereira, F. (2000) Formal Theory and Information theory: Together again? Philosophical Transactions of the Royal Society, Series A 358, 1239-1253. • Seidenberg, M.S., & MacDonald, M.C. (1999) A probabilistic constraints approach to language acquisition and processing. Cognitive Science 23, 569-588. • Tomasello, M. (2003)Constructing a Language: A Usage-Based Theory of Language Acquisition. Harvard University Press.
The CUNY simulation project • We program learning algorithms proposed in the literature. (12 so far) • Run each one on a large domain of human-like languages. 1,000 trials (1,000 ‘children’) each. • Success rate: % of trials that identify target. • Speed: average # of input sentences consumed until learner has identified the target grammar. • Reliability/speed: # of input sentences for 99% of trials ( 99% of ‘children’) to attain the target. • Subset Principle violations and one-step local maxima excluded by fiat. (Explained below as necessary.)
Designing the language domain • Realistically large, to test which models scale up well. • As much like natural languages as possible. • Except, input limited like child-directed speech. • Sentences must have fully specified tree structure(not just word strings), to test models like STL. • Should reflect theoretically defensible linguistic analyses (though simplified). • Grammar format should allow rapid conversion into the operations of an effective parsing device.
Selection criteria for our domain We have given priority to syntactic phenomena which: • Occur in a high proportion of known natl langs; • Occur often in speech directed to 2-3 year olds; • Pose learning problems of theoretical interest; • A focus of linguistic / psycholinguistic research; • Syntactic analysis is broadly agreed on.
By these criteria • Questions, imperatives. • Negation, adverbs. • Null subjects, verb movement. • Prep-stranding, affix-hopping (though not widespread!). • Wh-movement, but no scrambling yet.
Not yet included • No LF interface (cf. Villavicencio 2000) • No ellipsis; no discourse contexts to license fragments. • No DP-internal structure; Case; agreement. • No embedding (only degree-0). • No feature checking as implementation of movement parameters (Chomsky 1995ff.) • No LCA / Anti-symmetry (Kayne 1994ff.)
Our 13 parameters (so far) ParameterDefault • Subject Initial (SI) yes • Object Final (OF) yes • Complementizer Initial (CI) initial • V to I Movement (VtoI) no • I to C Movement (of aux or verb) (ItoC) no • Question Inversion (Qinv = I to C in questions only) no • Affix Hopping (AH) no • Obligatory Topic (vs. optional) (ObT) yes • Topic Marking (TM) no • Wh-Movement obligatory (vs. none) (Wh-M) no • Pied Piping (vs. preposition stranding) (PI) piping • Null Subject (NS) no • Null Topic (NT) no
Parameters are not all independent Constraints on P-value combinations: • If [+ ObT] then [- NS]. (A topic-oriented language does not have null subjects.) • If [- ObT] then [- NT]. (A subject-oriented language does not have null topics.) • If [+ VtoI] then [- AH]. (If verbs raise to I, affix hopping does not occur.) (This is why only 3,072 grammars, not 8,192.)
Input sentences • Universal lexicon: S, Aux, O1, P, etc. • Input is word strings only, no structure. • Except, the learner knows all word categories and all grammatical roles! • Equivalent to some semantic boot-strapping; no prosodic bootstrapping (yet!)
Learning procedures In all models tested (unless noted), learning is: • Incremental = hypothesize a grammar after each input. No memory for past input. • Error-driven = if Gcurrent can parse the sentence, retain it. • Models differ in what the learner does when Gcurrent fails = grammar change is needed.
The learning models: preview • Learners that decode: STLs. Waiting (‘squeaky clean’) Guessing • Grammar-testing learners Triggering Learning Algorithm (G&W) Variational Learner (Yang 2000) • …plus benchmarks for comparison too powerful too weak
Learners that decode: STLs • Strong STL: Parallel parse input sentence, find all successful grammars. Adopt p-values they share. (A useful benchmark, not a psychological model.) • Waiting STL: Serial parse. Note any choice-point in the parse. Set no parameters after a choice. (Never guesses. Needs fully unambig triggers.)(Fodor 1998a) • Guessing STLs: Serial. At a choice-point, guess.(Can learn from p-ambiguous input.) (Fodor 1998b)
Guessing STLs’ guessing principles If there is more than one new p-value that could patch the parse tree… • Any Parse: Pick at random. • Minimal Connections: Pick the p-value that gives the simplest tree. ( MA + LC) • Least Null Terminals: Pick the parse with the fewest empty categories. ( MCP) • Nearest Grammar: Pick the grammar that differs least from Gcurrent.
Grammar-testing: TLA • Error-driven random: Adopt any grammar. (Another baseline; not a psychological model.) • TLA (Gibson & Wexler, 1994): Change any one parameter. Try the new grammar on the sentence. Adopt it if the parse succeeds. Else pass. • Non-greedy TLA (Berwick & Niyogi, 1996): Change any one parameter. Adopt it. (No test of new grammar against the sentence.) • Non-SVC TLA (B&N 96): Try any grammar other than Gcurrent. Adopt it if the parse succeeds.
Grammar-testing models with memory • Variational Learner (Yang 2000,2002) has memory for success / failure of p-values. • A p-value is: rewarded if in a grammar that parsed an input; punished if in a grammar that failed. • Reinforcement is approximate, because of interaction. A good p-value in a bad grammar is punished, and vice versa.
With memory: Error-driven VL • Yang’s VL is not error-driven. It chooses p-values with probability proportional to their current success weights. So it occasionally tries out unlikely p-values. • Error-driven VL (Sakas & Nishimoto, 2002) Like Yang’s original, but: First, set each parameter to its currently more successful value. Only if that fails, pick a different grammar as above.
Previous simulation results • TLA is slower than error-driven random on the G&W domain, even when it succeeds (Berwick & Niyogi 1996). • TLA sometimes performs better, e.g., in strongly smooth domains (Sakas 2000, 2003). • TLA fails on 3 of G&W’s 8 languages, and on 95.4% of Kohl’s 2,304 languages. • There is no default grammar that can avoid TLA learning failures. The best starting grammar succeeds only 43% (Kohl 1999). • Some TLA-unlearnable languages are quite natural, e.g., Swedish-type settings (Kohl 1999). • Waiting-STL is paralyzed by weakly equivalent grammars (Bertolo et al. 1997).
Summary of performance • Not all models scale up well. • ‘Squeaky-clean’ models (Strong / Waiting STL)fail often. Need unambiguous triggers. • Decoding models which guess are most efficient. • On-line parsing strategies make good learning strategies. (?) • Even with decoding, conservative domain search fails often (Nearest Grammar STL). • Thus: Learning-by-parsing fulfills its promise. Psychologically natural ‘triggering’ is efficient.
Now that we have a workable model… • Use it to investigate questions of interest: • Are some languages easier than others? • Do default starting p-values help? • Does overt morphological marking facilitate syntax learning? • etc….. • Compare with psycholinguistic data, where possible. This tests the model further, and may offer guidelines for real-life studies.
What makes a language easier? • Language difficulty is not predicted by how many of the target p-settings are defaults. • Probably what matters is parametric ambiguity • Overlap with neighboring languages • Lack of almost-unambiguous triggers • Are non-attested languages the difficult ones? (Kohl, 1999: explanatory!)
Sensitivity to input properties • How does the informativeness of the input affect learning rate? • Theoretical interest: To what extent can UG-based p-setting be input-paced? • If an input-pacing profile does not match child learners, that could suggest biological timing (e.g., maturation).
Some input properties • Morphological marking of syntactic features: Case Agreement Finiteness • The target language may not provide them. Or the learner may not know them. • Do they speed up learning? Or just create more work?
Input properties, cont’d For real children, it is likely that: • Semantics / discourse pragmatics signals illocutionary force:[ILLOC DEC], [ILLOC Q] or [ILLOC IMP] • Semantics and/or syntactic context reveals SUBCAT (argument structure) of verbs. • Prosody reveals some phrase boundaries(as well as providing illocutionary cues).
Making finiteness audible • [+/-FIN] distinguishes Imperatives from Declaratives. (So does [ILLOC], but it’s inaudible.) • Imperatives have null subject. E.g., Verb O1. • A child who interprets an IMP input as a DEC could mis-set [+NS] for a [-NS] lang. • Does learning become faster / more accurate when [+/-FIN] is audible? No. Why not? • Because Subset Principle requires learner to parse IMP/DEC ambiguous sentences as IMP.
Providing semantic info: ILLOC • Suppose real children know whether an input is Imperative, Declarative or Question. • This is relevant to [+ItoC] vs. [+Qinv]. ([+Qinv] [+ItoC] only in questions) • Does learning become faster / more accurate when [ILLOC] is audible? No. It’s slower! • Because it’s just one more thing to learn. • Without ILLOC, a learner could get allword strings right, but their ILLOCs and p-values all wrong – and count as successful.
Providing SUBCAT information • Suppose real children can bootstrap verb argument structure from meaning / local context. • This can reveal when an argument is missing. How can O1, O2 or PP be missing? Only by [+NT]. • If [+NT] then also [+ObT] and [-NS] (in our UG). • Does learning become faster / more accurate when learners know SUBCAT? Yes. Why? • SP doesn’t choose between no-topic and null- topic. Other triggers are rare. So triggers for [+NT] are useful.