EVALUATING MODELS OF PARAMETER SETTING

EVALUATING MODELS OF PARAMETER SETTING Janet Dean Fodor Graduate Center, City University of New York

On behalf of CUNY-CoLAGCUNY Computational Language Acquisition GroupWith support from PSC-CUNY William G. Sakas, co-director Carrie Crowther Lisa Reisig-Ferrazzano Atsu Inoue Iglika Stoyneshka-Raleva Xuan-Nga Kam Virginia Teller Yukiko Koizumi Lidiya Tornyova Eiji Nishimoto Erika Troseth Artur Niyazov Tanya Viger Iana Melnikova Pugach Sam Wagner www.colag.cs.hunter.cuny.edu

Before we start… Warning: I may skip some slides. But not to hide them from you. Every slide is at our websitewww.colag.cs.hunter.cuny.edu

What we have done • A factory for testing models of parameter setting. • UG + 13 parameter values → 3,072 languages (simplified but human-like). • Sentences of a target language are the input to a learning model. • Is learning successful? How fast? • Why?

Our Aims • A psycho-computational model of syntactic parameter setting. • Psychologically realistic. • Precisely specified. • Compatible with linguistic theory. • And…it must work!

Parameter setting as the solution(1981) • Avoids problems of rule-learning. • Only 20 (or 200) facts to learn. • Triggering is fast & automatic = no linguistic computation is necessary. • Accurate. • BUT: This has never been modeled.

Parameter setting as the problem(1990s) R. Clark, and Gibson & Wexler have shown: • P-setting is not labor-free, not always successful. Because… The parameter interaction problem.  The parametric ambiguity problem. • Sentences do not tell which parameter values generated them.

This evening… Parameter setting: • How severe are the problems? • Why do they matter? • How to escape them? • Moving forward: from problems to explorations.

Problem 1: Parameter interaction • Even independent parameters interact in derivations (Clark 1988,1992). • Surface string reflects their combined effects. • So one parameter may have no distinctive isolatable effect on sentences. = no trigger, no cue (cf. cue-based learner; Lightfoot 1991; Dresher 1999) • Parametric decoding is needed. Must disentangle the interactions, to identify which p-values a sentence requires.

Parametric decoding Decoding is not instantaneous. It is hard work. Because… • To know that a parameter value is necessary, must test it in company of all other p-values. • So whole grammars must be tested against the sentence. (Grammar-testing ≠ triggering!) • All grammars must be tested, to identify one correct p-value. (exponential!)

O3 Verb Subj O1[+WH] P Adv. Decoding • This sets: no wh-movt, p-stranding, head initial VP, V to I to C, no affix hopping, C- initial, subj initial, no overt topic marking • Doesn’t set: oblig topic, null subj, null topic

More decoding Adv[+WH] P NOT Verb S KA. • This sets everything except ±overt topic marking. Verb[+FIN]. • This sets nothing, not even +null subject.

Problem 2: Parametric ambiguity • A sentence may belong to more than one language. • A p-ambiguous sentence doesn’t reveal thetarget p-values (even if decoded). • Learner must guess (= inaccurate) or pass (= slow, + when? ) • How much p-ambiguity is there in natural language? Not quantified; probably vast.

Scale of the problem (exponential) • P-interaction and p-ambiguity are likely to increase with the # of parameters. • How many parameters are there? 20 parameters → 220 grammars = over a million 30 parameters → 230 grammars = over a billion 40 parameters → 240 grammars = over a trillion 100 parameters → 2100 grammars = ???

Learning models must scale up • Testing all grammars against each input sentence is clearly impossible. • So research has turned to search methods: how to sample and test the huge field of grammars efficiently.  Genetic algorithms (e.g.,Clark 1992)  Hill-climbing algorithms (e.g.,Gibson & Wexler’s TLA 1994)

Our approach • Retain a central aspect of classic triggering: Input sentences guide the learner toward the p-values they need. • Decode on-line; parsing routines do the work. (They’re innate.) • Parse the input sentence (just as adults do, for comprehension) until it crashes. • Then the parser draws on other p-values, to find one that can patch the parse-tree.

Structural Triggers Learners (CUNY) • STLs find one grammar for each sentence. • More than that would require parallel parsing, beyond human capacity. • But the parser can tell on-line if there is (possibly) more than one candidate. • If so: guess, or pass (wait for unambig). • Considers only real candidate grammars;directed by what the parse-tree needs.

Summary so far… • Structural triggers learners (STLs) retain an important aspect of triggering (p-decoding). • Compatible with current psycholinguistic models of sentence processing. • Hold promise of being efficient. (Home in on target grammar, within human resource limits.) • Now: Do they really work, in a domain with realistic parametric ambiguity?

Evaluating learning models • Do any models work? • Reliably? Fast? Within human resources? • Do decoding models work better than domain-search (grammar-testing) models? • Within decoding models, is guessing better or worse than waiting?

Hope it works! If not… • The challenge: What is UG good for? • All that innate knowledge, only a few facts to learn, but you can’t say how! • Instead, one simple learning procedure:Adjust the weights in a neural network;  Record statistics of co-occurrence frequencies. • Nativist theories of human language are vulnerable until some UG-based learner is shown to perform well.

Non-UG-based learning • Christiansen, M.H., Conway, C.M. and Curtin, S. (2000). A connectionist single-mechanism account of rule-like behavior in infancy. In Proceedings of 22nd Annual Conference of Cognitive Science Society, 83-88. Mahwah, NJ: Lawrence Erlbaum. • Culicover, P.W. and Nowak, A. (2003)A Dynamical Grammar. Oxford, UK: Oxford University Press. Vol.Two of Foundations of Syntax. • Lewis, J.D. and Elman, J.L.(2002) Learnability and the statistical structure of language: Poverty of stimulus arguments revisited. In B. Skarabela et al. (eds) Proceedings of BUCLD 26, Somerville, Mass: Cascadilla Press. • Pereira, F. (2000) Formal Theory and Information theory: Together again? Philosophical Transactions of the Royal Society, Series A 358, 1239-1253. • Seidenberg, M.S., & MacDonald, M.C. (1999) A probabilistic constraints approach to language acquisition and processing. Cognitive Science 23, 569-588. • Tomasello, M. (2003)Constructing a Language: A Usage-Based Theory of Language Acquisition. Harvard University Press.

The CUNY simulation project • We program learning algorithms proposed in the literature. (12 so far) • Run each one on a large domain of human-like languages. 1,000 trials (1,000 ‘children’) each. • Success rate: % of trials that identify target. • Speed: average # of input sentences consumed until learner has identified the target grammar. • Reliability/speed: # of input sentences for 99% of trials ( 99% of ‘children’) to attain the target. • Subset Principle violations and one-step local maxima excluded by fiat. (Explained below as necessary.)

Designing the language domain • Realistically large, to test which models scale up well. • As much like natural languages as possible. • Except, input limited like child-directed speech. • Sentences must have fully specified tree structure(not just word strings), to test models like STL. • Should reflect theoretically defensible linguistic analyses (though simplified). • Grammar format should allow rapid conversion into the operations of an effective parsing device.

Language domains created

Selection criteria for our domain We have given priority to syntactic phenomena which: • Occur in a high proportion of known natl langs; • Occur often in speech directed to 2-3 year olds; • Pose learning problems of theoretical interest; • A focus of linguistic / psycholinguistic research; • Syntactic analysis is broadly agreed on.

By these criteria • Questions, imperatives. • Negation, adverbs. • Null subjects, verb movement. • Prep-stranding, affix-hopping (though not widespread!). • Wh-movement, but no scrambling yet.

Not yet included • No LF interface (cf. Villavicencio 2000) • No ellipsis; no discourse contexts to license fragments. • No DP-internal structure; Case; agreement. • No embedding (only degree-0). • No feature checking as implementation of movement parameters (Chomsky 1995ff.) • No LCA / Anti-symmetry (Kayne 1994ff.)

Our 13 parameters (so far) ParameterDefault • Subject Initial (SI) yes • Object Final (OF) yes • Complementizer Initial (CI) initial • V to I Movement (VtoI) no • I to C Movement (of aux or verb) (ItoC) no • Question Inversion (Qinv = I to C in questions only) no • Affix Hopping (AH) no • Obligatory Topic (vs. optional) (ObT) yes • Topic Marking (TM) no • Wh-Movement obligatory (vs. none) (Wh-M) no • Pied Piping (vs. preposition stranding) (PI) piping • Null Subject (NS) no • Null Topic (NT) no

Parameters are not all independent Constraints on P-value combinations: • If [+ ObT] then [- NS]. (A topic-oriented language does not have null subjects.) • If [- ObT] then [- NT]. (A subject-oriented language does not have null topics.) • If [+ VtoI] then [- AH]. (If verbs raise to I, affix hopping does not occur.) (This is why only 3,072 grammars, not 8,192.)

Input sentences • Universal lexicon: S, Aux, O1, P, etc. • Input is word strings only, no structure. • Except, the learner knows all word categories and all grammatical roles! • Equivalent to some semantic boot-strapping; no prosodic bootstrapping (yet!)

Learning procedures In all models tested (unless noted), learning is: • Incremental = hypothesize a grammar after each input. No memory for past input. • Error-driven = if Gcurrent can parse the sentence, retain it. • Models differ in what the learner does when Gcurrent fails = grammar change is needed.

The learning models: preview • Learners that decode: STLs.  Waiting (‘squeaky clean’)  Guessing • Grammar-testing learners Triggering Learning Algorithm (G&W) Variational Learner (Yang 2000) • …plus benchmarks for comparison too powerful too weak

Learners that decode: STLs • Strong STL: Parallel parse input sentence, find all successful grammars. Adopt p-values they share. (A useful benchmark, not a psychological model.) • Waiting STL: Serial parse. Note any choice-point in the parse. Set no parameters after a choice. (Never guesses. Needs fully unambig triggers.)(Fodor 1998a) • Guessing STLs: Serial. At a choice-point, guess.(Can learn from p-ambiguous input.) (Fodor 1998b)

Guessing STLs’ guessing principles If there is more than one new p-value that could patch the parse tree… • Any Parse: Pick at random. • Minimal Connections: Pick the p-value that gives the simplest tree. ( MA + LC) • Least Null Terminals: Pick the parse with the fewest empty categories. ( MCP) • Nearest Grammar: Pick the grammar that differs least from Gcurrent.

Grammar-testing: TLA • Error-driven random: Adopt any grammar. (Another baseline; not a psychological model.) • TLA (Gibson & Wexler, 1994): Change any one parameter. Try the new grammar on the sentence. Adopt it if the parse succeeds. Else pass. • Non-greedy TLA (Berwick & Niyogi, 1996): Change any one parameter. Adopt it. (No test of new grammar against the sentence.) • Non-SVC TLA (B&N 96): Try any grammar other than Gcurrent. Adopt it if the parse succeeds.

Grammar-testing models with memory • Variational Learner (Yang 2000,2002) has memory for success / failure of p-values. • A p-value is: rewarded if in a grammar that parsed an input;  punished if in a grammar that failed. • Reinforcement is approximate, because of interaction. A good p-value in a bad grammar is punished, and vice versa.

With memory: Error-driven VL • Yang’s VL is not error-driven. It chooses p-values with probability proportional to their current success weights. So it occasionally tries out unlikely p-values. • Error-driven VL (Sakas & Nishimoto, 2002) Like Yang’s original, but: First, set each parameter to its currently more successful value. Only if that fails, pick a different grammar as above.

Previous simulation results • TLA is slower than error-driven random on the G&W domain, even when it succeeds (Berwick & Niyogi 1996). • TLA sometimes performs better, e.g., in strongly smooth domains (Sakas 2000, 2003). • TLA fails on 3 of G&W’s 8 languages, and on 95.4% of Kohl’s 2,304 languages. • There is no default grammar that can avoid TLA learning failures. The best starting grammar succeeds only 43% (Kohl 1999). • Some TLA-unlearnable languages are quite natural, e.g., Swedish-type settings (Kohl 1999). • Waiting-STL is paralyzed by weakly equivalent grammars (Bertolo et al. 1997).

Data by learning model

Summary of performance • Not all models scale up well. • ‘Squeaky-clean’ models (Strong / Waiting STL)fail often. Need unambiguous triggers. • Decoding models which guess are most efficient. • On-line parsing strategies make good learning strategies. (?) • Even with decoding, conservative domain search fails often (Nearest Grammar STL). • Thus: Learning-by-parsing fulfills its promise. Psychologically natural ‘triggering’ is efficient.

Now that we have a workable model… • Use it to investigate questions of interest: • Are some languages easier than others? • Do default starting p-values help? • Does overt morphological marking facilitate syntax learning? • etc….. • Compare with psycholinguistic data, where possible. This tests the model further, and may offer guidelines for real-life studies.

Are some languages easier?

What makes a language easier? • Language difficulty is not predicted by how many of the target p-settings are defaults. • Probably what matters is parametric ambiguity • Overlap with neighboring languages • Lack of almost-unambiguous triggers • Are non-attested languages the difficult ones? (Kohl, 1999: explanatory!)

Sensitivity to input properties • How does the informativeness of the input affect learning rate? • Theoretical interest: To what extent can UG-based p-setting be input-paced? • If an input-pacing profile does not match child learners, that could suggest biological timing (e.g., maturation).

Some input properties • Morphological marking of syntactic features: Case Agreement Finiteness • The target language may not provide them. Or the learner may not know them. • Do they speed up learning? Or just create more work?

Input properties, cont’d For real children, it is likely that: • Semantics / discourse pragmatics signals illocutionary force:[ILLOC DEC], [ILLOC Q] or [ILLOC IMP] • Semantics and/or syntactic context reveals SUBCAT (argument structure) of verbs. • Prosody reveals some phrase boundaries(as well as providing illocutionary cues).

Making finiteness audible • [+/-FIN] distinguishes Imperatives from Declaratives. (So does [ILLOC], but it’s inaudible.) • Imperatives have null subject. E.g., Verb O1. • A child who interprets an IMP input as a DEC could mis-set [+NS] for a [-NS] lang. • Does learning become faster / more accurate when [+/-FIN] is audible? No. Why not? • Because Subset Principle requires learner to parse IMP/DEC ambiguous sentences as IMP.

Providing semantic info: ILLOC • Suppose real children know whether an input is Imperative, Declarative or Question. • This is relevant to [+ItoC] vs. [+Qinv]. ([+Qinv]  [+ItoC] only in questions) • Does learning become faster / more accurate when [ILLOC] is audible? No. It’s slower! • Because it’s just one more thing to learn. • Without ILLOC, a learner could get allword strings right, but their ILLOCs and p-values all wrong – and count as successful.

Providing SUBCAT information • Suppose real children can bootstrap verb argument structure from meaning / local context. • This can reveal when an argument is missing. How can O1, O2 or PP be missing? Only by [+NT]. • If [+NT] then also [+ObT] and [-NS] (in our UG). • Does learning become faster / more accurate when learners know SUBCAT? Yes. Why? • SP doesn’t choose between no-topic and null- topic. Other triggers are rare. So triggers for [+NT] are useful.

EVALUATING MODELS OF PARAMETER SETTING

EVALUATING MODELS OF PARAMETER SETTING

Presentation Transcript

Models for Evaluating Teacher Effectiveness

Models for Evaluating Teacher Effectiveness

Evaluating Regression Models

Setting Goals, Evaluating Impact, and Theories of Change

Evaluating Predictive Models of Software Quality

Evaluating Information Literacy Process Models

Parameter Estimation of Ship Linear Maneuvering Models

Evaluating Information Literacy Process Models

Two Models of Evaluating Probabilistic Planning

LINEAR REGRESSION: Evaluating Regression Models

Models for Evaluating Teacher Effectiveness

parameter setting and data pre-processing

Evaluating The Validity of Models

Evaluating Compensation Models

Automated Parameter Setting Based on Runtime Prediction:

Evaluating Risk Adjustment Models

Parameter Setting

Evaluating Theoretical Models

Evaluating Risk Adjustment Models

Automated Parameter Setting Based on Runtime Prediction:

Parameter estimate in IBM Models: