320 likes | 333 Views
Explore the psychological resources behind language learning models to understand syntax origins. Uncover the feasibility of learning efficient language structures and the driving forces limiting cognitive resources. Discover the evolutionary journey from innately guided acquisition to rule-based parameter settings in linguistic studies. Delve into the challenges of trigger recognition and the switch-flipping mechanism. Gain insights into language complexity and grammatical acquisition efficiency.
E N D
‘Ideal’ Language Learning and the Psychological Resource Problem MIT Workshop: Where does syntax come from? Have we all been wrong? October 19, 2007 William Gregory Sakas Janet Dean Fodor City University of New York (CUNY)
CUNY-CoLAG CUNYComputational Language Acquisition Group http://www.colag.cs.hunter.cuny.edu CUNY-CoLAG graduate students: David Brizan Carrie Crowther Arthur Hoskey Xuân-Nga Kam Iglika Stoyneshka Lidiya Tornyova
Agenda • What we do • What’s troubling us • An invitation to discussion
CoLAG research Create large domain of parameterized languagesfor evaluating learning models (Fodor & Sakas 04) Watch-dog role on ‘richness of stimulus’ claims. (Kam et al. 05, i.p.) poverty of stimulus language domain Solving modeling problems: noise; overgeneration. (Fodor & Sakas 06; Crowther et al. 04) learnability problems Compare efficiency of 12 parameter setting models (Fodor & Sakas 04) testing models
A conceptual history of modeling P-setting • Psychologically feasible learning (Pinker 1979) • Triggering – the ideal (Chomsky 1981) • But too many interactions (Clark 1989) • Domain search by parse-test (Gibson & Wexler 1994) • Better domain search (Yang 2000) • Scanning for I-triggers (Lightfoot 1999) • Parse-test with unambiguous triggers (Fodor 1998) • Back to innate triggers? (Sakas & Fodor, in prep.)
A driving force - limiting resources • All models were driven by an attempt to limit the resources to what can reasonably be attributed to young children. (What is reasonable? Needs sharpening, but not urgently. << Aren’t we gonna ask ‘them’?>>) • Limit the complexity of innate knowledge (of the possible grammars, of subset relations among them, triggers, etc.) • Limit the storage of input sentences, storage of statistics over the input, storage of grammars tested & rejected, etc. • Limit the amount of input needed in order to attain the target grammar. • Limit the amount of processing of each input sentence for the purpose of extracting the information it contains.
Learnability vs. Feasibility • Learnability = study of what can be learned in principle • Feasibility = study of how efficiently learning takes place • Learnability is a non-issue for (finite) parametric domains • We ask the feasibility question: Given reasonable psychocomputational resources, does a learner converge after encountering a reasonable number of input sentences?
Reasonable psychocomputational resources • Full parallel parse, retrieving all structural analyses of an input string. • Multiple parses of the same input string, with different grammars. • Memory for all (most, many) previous input sentences. • On-line computation of language-inclusion relationships (needed to apply the Subset Principle for conservative learning). • A mental list of disconfirmed grammars. • An online-computed probabilistic metric for all grammars. << Reword>> Finite number of parameters still implies exponential search space of grammar hypotheses
Reasonable psychocomputational resources • At most two parses per input sentence (inherited from Gibson & Wexler) • Memory for a small number of ‘informative’ input sentences (e.g., sentences that had caused an hypothesis change in the past, i.e., potential ‘trigger’ sentences) • An online-computed probabilistic metric for each parameter (inherited from Yang)
Triggering as switch-flipping – the ideal • Rule-based acquisition was never feasibly implemented. Data combing, hypothesis formation, little innate guidance. • Shift to parameter theory: languages differ only in lexicon and (including) values of a small number of parameters. • Tight UG guidance: Finite set of candidate grammars. • Input sentences ‘trigger’ p-values. That is: the learning mechanism knows which parameter values to adopt to license each sentence, without linguistic computation. • Memoryless (‘incremental’) learning. Choose next grammar hypothesis on basis of current input sentence only. • Maybe deterministic: Parameters set just once, correctly.
But: interactions and ambiguity (Clark) • What can Pat sing?Why is the object in initial position? • +WH-movement or +Scrambling • Pat expects Sue to win. How is case licensed on “Sue”? • ECM: matrix verb governs lower subject.or SCM: non-finite Infl assigns case to its subject. • An error may then cause an error re long-distance anaphora. • Over-optimistic: For each parameter, an unambiguous trigger, innately specified. Realistically, a potential trigger may be masked by other derivational phenomena. • The null subject parameter is not typical!
Recognition problem for triggers • How does the switch-setting mechanism work? Given sentence s, how does LM know which switches to flip? • This problem arises even for unambiguous triggers. • G&W’s 3-P domain had an unambiguous trigger for each P in each grammar. But 5 of 8 languages were not acquirable, because LM couldn’t access the trigger info. • Two alternatives rejected by Gibson & Wexler: (i) Innate trigger description. (Needs global triggers or huge list.) (ii) On-line calculation of effects of Ps. (Excessive computation.) • Instead, the TLA: Trial-and-error. Pick a P-grammar (by criteria). Try parsing the sentence. If it works, adopt the grammar.Wastes input; not ‘automatic’, not error-free, not deterministic. NOT TRIGGERING! (But the parse test is good.)
Improving the parse-test approach • By design, the TLA was a very impoverished mechanism. • Minimal resources little info extracted from input slow.It could only follow its nose through the language domain, led on by an occasional lucky successful parse. • Yang (2000) gave it a memory for the fate of past hypotheses, so that it could accumulate knowledge. • To record parse-test results grammar by grammar isn’t practical. Record them P by P. A weight represents success of each value. Approximate (A good P-value in a bad grammar, and v.v.) • Mechanism: Select a grammar to test next, with probability based on the weights of its P-values. If it parses the sentence, upgrade the weights of its P-values; if it fails, downgrade them.
Pro and con Yang’s ‘variational learner’ • This mechanism makes more use of the input: It gains knowledge from every parse-test, not just from successes. • Also, it avoids the learning failures (local maxima) of TLA, by occasionally sampling unlikely grammars. This breaks it free of a false direction without losing accumulated knowledge. • It does all this while still using the parser to implicitly identify triggers. (No innate specification of triggers is needed.) • But still inefficient trial-&-error. Yang’s simulations and ours agree: an order of magnitude more input consumed than other models. • Still non-deterministic. After sufficient weight of evidence, it may lock in a P-value; but mostly it doesn’t know which value is right. So each parameter may swing back-and-forth repeatedly.
I-triggers and E-triggers • Lightfoot (1999) aimed for deterministic P-setting, by means that G&W had rejected as infeasible. He claimed that simple global triggers can be defined. They are I-language properties. • E.g. Trigger for V2 is [SpecCP Nonsubject ] [C Verb+fin ] • Mechanism: Learner “scans the input” for I-triggers. • It’s not switch-setting, but splendid if it works. However, the input is surface word strings. Would need translation from deep global I-triggers to surface language-specific E-triggers. Linguists do try, but no systematic compilation to date. • Can it be done?? E.g., for every V2 language, regardless of all other properties, what word-string property reveals +V2? • This is what the G&W parse-test did for free. Human parser is designed to fit trees to strings. No need to specify the E-triggers.
A mechanism for detecting I-triggers • Fodor (1998) also recommended I-triggers, & proposed a mechan- ism for recognizing them in word strings. Don’t look for them. Donate them to the parser when it hangs up for lack of them. • This gets more goodness out of the parse-test. Not just yes/no, but which P-value did the work – adopt that one. • Closer to Chomsky’s original concept of triggering: An input sentence tells the learning mechanism which Ps can license it. (Not: First pick a grammar and then see if it works.) We call this parametric decoding. Next best thing to switches! • Not domain search, but I-to-E trigger conversion by the parser.Uses existing resources. Also, an unambiguous trigger could set a P indelibly, halving the search space for later Ps.
Unambig triggers deterministic learning? • Mechanism for deterministic learning based on unambiguous triggers only: • If only one parse, adopt the P-values that contributed to it. • If more than one parse, the input is ambiguous, so: (a) Discard it entirely. Problem: this leaves too few unambiguous triggers; confirmed by our simulation data: learning fails often. (b) Adopt just the P-values that are in all of the parses. Problem: Requires a full parallel parse of the sentence, but adult parsing data indicate that parallel parsing is limited at best – insufficient for learners to do full parametric decoding of an input word string. • Over the resource limits again! Non-determinism unavoidable?
Taking stock so far • Parameters may be psychologically real. Triggers for P-values may exist. But an account of how human learners know and use the triggers to set the P’s has been flummoxed by: • how to detect the triggers in the input… • without overstepping reasonable limits on innate knowledge or processing capacity… • or resorting to linguistically-undirected search through the field of all possible grammars. • Attractiveness of Ps for capturing linguistic diversity is enhanced by the promise of P-setting as an efficient learning procedure, to explain the speed & uniformity of human language acquisition. • So psycho-computational research needs to deliver a good implementation of P-setting! We’ll try. But first: another problem.
The Subset Principle wants unambiguity • A final blow! The Subset Principle runs amok in incremental learning if it’s not based on unambiguous triggers (which we’ve just seen aren’t realistic for a parse-test mechanism). • Summary of the SP problem (see more below): • Insufficient negative evidence demands conservative learning. • Adopt the smallest language compatible with the available data. (SP) • Without memory for past inputs, available data = the current sentence. • SP demands an absurdly small language, losing past facts acquired! • E.g. “It’s bedtime”. Give up wh-movt, topicalization, passive, aux-inversion, prep-stranding, etc. • SP says: Retrench on existing P-values, when set a new one.
SP-retrenchment is due to trigger ambiguity • LM cannot trust any parameter values adopted on the basis of input that was (or even might have been) parametrically ambiguous. • The sentences introduced by those p-values may not be in the target language. • So LM cannot hold onto those sentences (those P-values) when setting another parameter later. • Give up all sentences / all marked parameter values that are not entailed by the current input sentence! • Unless – you know that you set the parameter on the basis of a globally unambiguous trigger.
So: Back to unambiguous triggers • We are exploring three solutions to excessive retrenchment: • Add memory for past input. (Non-incremental; Fodor & Sakas 2005) • Add memory for disconfirmed grammars. (Fodor, Sakas & Hoskey 2007) • Don’t retrench on Ps set by unambiguous evidence.(Here, today) • The first two are computational solutions: add resources. The third is where linguistics could contribute. • Even if only a few Ps have unambiguous triggers & the learner knows them, setting them might disambiguate triggers for others. • But the learning system has to know when a trigger is unambig. • Innate unambig I-triggers, translated to E-triggers, could do it. So let’s go back to this idea. Can it work, for at least some Ps? And can one P provide an unambig trigger for the next one?
Unambiguous triggers in CoLAG domain • We ask: For each of the 13 Ps in our domain, does it have at least one unambiguous E-trigger for at least one of its values, in every language with that value? • If only for one value, the other one could be taken as default. So we ask: Is this default linguistically plausible? • For each P-value that has unambiguous E-triggers, what do those triggers have in common? Do they embody a single, global I-trigger? • If so, are the E-triggers transparently related to the I-trigger, so that a learner could recognize it without excessive linguistic computation? • Seeking unambiguous, global, transparent triggers.
Facts about CoLAG languages • Sentences: S Aux V Adv, Wh-O1 V S PP, O3-wa V P ka • Universal lexicon: S, O1, P, Aux, -wa, etc. (Inherited and expanded from Gibson & Wexler’s TLA simulations.) • Input stream = random sequence of all sentences in the lg. Input is word strings only; tree structure isn’t ‘audible’. • Except: Learner is given lexical categories & grammatical roles! ( some semantic boot-strapping, no prosodic bootstrapping). • All sentences are degree-0 (no embedded clauses). A language has 545 sentences on average. • The domain contains 28,924 distinct sentences (as strings) 60,272 distinct syntactic trees 3,072 distinct sets of P-values (grammars) • Average ambiguity of sentences = 53 languages per sentence.
Parameters in the domain 13 standard Ps, simplified. Necessarily old-fashioned, but illustrate interesting learning problems.
Some lgs lack unambig triggers for a P-value Missing = there are lgs that need the value but w/o an unambiguous trigger for it.
E.g., What are the triggers for headedness? • An easy example: Headedness (in IP, VP, PP). • I-triggers: H-Initial: I before VP, V before complements, P before NP. H-Final: I after VP, V after complements, P after NP. • Many E-triggers: H-Initial: P O3. non-initial Aux V. non-initial O1 O2 . H-Final: Non-initial O3 P. V NOT Aux. • Linking facts: • Inside VP, only V O1 O2 PP Adv order, or the reverse. • XPs can move (only) leftward into Spec,CP. Head-movement: Only Aux or finite Verb move, only to I or C (leftward or rightward). C-direction may differ from IP, VP etc. • A general strategy for P-setting:Transparent.To set underlying order, do not rely on any movable item in a possible landing site position.
DECIDE WITH JANET WHICH “MORE DIFFICULT” TRIGGERING EXAMPLE TO PUT HERE
Summary Despite Clark’s pessimism & Lightfoot’s over-optimism: Is it conceivable that infants acquire language rapidly because they have innate knowledge of unambiguous P-triggers? • We’ve seen it’s not guaranteed that where there’s info in the input, there’s a trigger. • Also, even if there’s a trigger, it may not be simple & transparent & global enough to be psychologically plausible. • Nevertheless, we found unambiguous global triggers for 9 of 13 Ps. And we’ve done pretty well at rescuing the other 4 – though there’s more work to be done. • We’ve used P-ordering to simplify triggers for setting later Ps, so their triggers don’t need to fold in triggers for more basic Ps. • We’ve found conditional triggers, where setting earlier Ps indelibly creates new unambiguous triggers for later Ps.
Prospects • No general success yet. E.g. only 30% of +V2 languages have an unambig trigger. But maybe because V2 needs breaking into several Ps; also, we haven’t yet fully explored ordering/conditioning for V2. • Additional methods for investigating: • Run new simulations in which Ps with unambiguous triggers are set permanently. Does it speed convergence? Any errors emerge? • • In existing simulation data, look for Ps that got set early, & led to rapid convergence. Can we detect an optimal ordering? • If we can’t find triggers for a P in some context, can we show that just that P is irrelevant? Ideal defense of a trigger model! • If not, then we’re looking for a computational solution (far from original triggering concept), rather than a linguistic solution. Will we find legitimate (ordered) triggers to set all the Ps?
Your advice, please • So - should we keep trying? • How much does it matter whether P-triggering can be shown to work, within the resource limits of normal children? • Would the linguistic value of Ps still stand even if acquisition consisted in massive statistical analysis of the learner’s corpus? • How does MP change things? Generally, the more abstract and explanatory the linguistic analysis, the greater the divergence between E-triggers and I-triggers. Hard on learners. • Does your systematization of Ps change things for acquisition?Can it help us with the E-trigger / I-trigger translation? • An alliance of linguistics and computation may be what it takes.