Modeling the Effect of Cross-Language Ambiguity on Human Syntax Acquisition

Modeling the Effect of Cross-Language Ambiguity on Human Syntax Acquisition Fourth Conference on Computational Natural Language Learning 13 Sept 2000 Lisbon, Portugal William Gregory Sakas

Why computationally model human language acquisition? • Pinker, 1979 : • "...it may be necessary to find out how language learning could work in order for the developmental data to tell us how is does work." [emphasis mine]

Primary point of this talk: Not enough to build a series of computer simulations of a cognitive model of human language acquisition and claim that it mirrors the process by which a child acquires language. The (perhaps obvious) fact is that learners are acutely sensitive to cross-language ambiguity. Whether or not a learning model is ultimately successful as a cognitive model is an empirical issue; depends on the ‘fit’ of the simulations with the facts about the distribution of ambiguity in human languages.

What’s coming: • 1) Background on some linguistic theories of acquisition • 2) A case study analysis of one parameter setting model : The Structural Triggers Learner (STL) chosen for three reasons: • i) Algorithm takes to heart current generative linguistic theory • ii) Not dependant on a particular grammar formalism • iii) Mathematics of the Markov analysis is straightforward. • 3) Conjectures and a proposed research agenda

Learnability- Under what conditions is learning possible? Feasibility - Is acquisition possible within a reasonable amount of time and/or with a reasonable amount of work?

on off Principle and Parameters Framework All languages share universal principles (UG) e.g. all languages have subjects of some sort Differ wrt to the settings of a finite number of parameters e.g. overt subjects are optional in a sentence ( yes / no ) (e.g. English - no) (e.g Spanish - yes ) Null Subject Parameter - optional overt subjects

A three parameter domain (Gibson and Wexler, 1994) SV / VS - subject precedes verb / verb precedes subject VO / OV - verb precedes object / object precedes verb +V2 / -V2 - verb or aux must be in the second position in the sentence ‘Sentences’ are strings of the symbols: S, V, 01, 02, aux, adv Mari will feed the bird  S aux V O

Two example languages (finite, degree-0) SV VO -V2 (English-like) S V S V O S V O1 O2 S AUX V S AUX V O S AUX V O1 O2 ADV S V ADV S V O ADV S V O1 O2 ADV S AUX V ADV S AUX V O ADV S AUX V O1 O2 SV OV +V2 (German-like) S V S V O O V S S V O2 O1 O1 V S O2 O2 V S O1 S AUX V S AUX O V O AUX S V S AUX O2 O1 V O1 AUX S O2 V O2 AUX S O1 V ADV V S ADV V S O ADV V S O2 O1 ADV AUX S V ADV AUX S O V ADV AUX S O2 O1 V

Surprisingly, G&W’s simple 3-parameter domain presents nontrivial obstacles to several types of learning strategies, but the space is ultimately learnable. (G&W 19994; Berwick & Niyogi 1996; Frank and Kapur 1996; Turkel 1996; Bertolo In press). Big question: How will the learning process scale up in terms offeasibility as the number of parameters increases? Two problems for most acquisition strategies: 1) Ambiguity 2) Size of the domain

Ambiguity across two example languages (finite, degree-0) SV VO -V2 (English-like) S V S V O S V O1 O2 S AUX V S AUX V O S AUX V O1 O2 ADV S V ADV S V O ADV S V O1 O2 ADV S AUX V ADV S AUX V O ADV S AUX V O1 O2 SV OV +V2 (German-like) S V S V O O V S S V O2 O1 O1 V S O2 O2 V S O1 S AUX V S AUX O V O AUX S V S AUX O2 O1 V O1 AUX S O2 V O2 AUX S O1 V ADV V S ADV V S O ADV V S O2 O1 ADV AUX S V ADV AUX S O V ADV AUX S O2 O1 V Indicates a few ambiguous strings

Ambiguity robs the learner of certainty that a parameter value is correct for the target language. And, the search space is exponential. # Parameters = 30 # Grammars = 230= 1,073,741,824 Search heuristics need to be employed.

Creating an input space for a linguistically plausible (large) domain is not practical. -- simulations. So, how to answer questions of feasibility as the number of grammars (exponentially) scales up? Answer: introduce some formal notions in order to abstract away from the specific linguistic content of the input data.

A hybrid approach (formal/empirical) • 1) formalize the learning process and input space • 2) use the formalization in a Markov structure to empirically test the learner across a wide range of learning scenarios • The framework gives general data on the expected performance of acquisition algorithms. Can answer the question: • Given learner L, if the input space exhibits characteristics x, y and z, is feasible learning possible?

A case study: The Structural Triggers Learner (Fodor 1998)

Some background assumptions No negative evidence -The input sample or textis a randomly drawn collection of positive (grammatical) examples of sentences from L(Gtarg). One hypothesis at a time - The learner evaluates one grammar at a time. The current hypothesis Gcurr - denotes the grammar being entertained by the learner at some particular point in time.Successful acquisition - The learner convergeson a grammar Gtarg when Gcurr= Gtarg, and Gcurr never changes.

The Parametric Principle - (Fodor 1995,1998; Sakas and Fodor In press) Set individual parameters. Do not evaluate whole grammars. • Halves the size of the grammar pool with each successful learning event. • e.g. When 5 of 30 parameters are set •  3% of grammar pool remains

VP VP on off O V V O Problem: • Parametric Principle requires certainty • But how to know when a sentence may be parametrically ambiguous? Solution: Structural Triggers Learner (STL), Fodor (1995, 1998) For the STL a parameter value = structural trigger = “treelet” (e.g English) (e.g. German) (e.g English) (e.g. German) V before O VO / OV

STL Algorithm • — Sentence. • — Parse with current grammar G. • Success  Keep G. • Failure  Parse with Gcurr + all parametric treelets Adopt the treelets that contributed.

So, the STL • — uses the parser to decode parametric signatures of sentences. • — can detect parametric ambiguity • Don't learn from sentences that contain a choice point • = waiting-STL variant • — and thus can abide by the Parametric Principle

Computationally modeling STL performance: • the nodes represent current number of parameters that have been set - not grammars • arcs represent a possible change in the number of parameters that have been set A state space for the STL performing in a 3 parameter domain 2 3 0 1 Here, each input may express 0, 1 or 2 new parameters.

Transition probabilities for the waiting-STL depend on: Learner’s state • The number of parameters that have been set( t ) • the number of relevant parameters ( r ) • the expression rate ( e ) • the ambiguity rate ( a ) • the "effective" expression rate ( e' ) Formalization of the input space.

Transition probabilities for the waiting-STL

Results after Markov Analysis Seemingly linear in # of parameters Exponential in % ambiguity

Striking effect of ambiguity (r fixed) 20 parameters to be set 10 parameters expressed per input Logarithmic scale

Subtle effect of ambiguity on efficiency wrt rAs ambiguity increases, the cost of the Parametric Principle skyrockets as the domain scales up (r increases) Linear scale 221 28 x axis = # of parameters in the domain 10 parameters expressed per input 9,772,740 9,878 # parameters

The effect of ambiguity (interacting with e and r) How / where is the cost incurred? By far the greatest amount of damage inflicted by ambiguity occurs at the very earliest stages of learning - the wait for the first fully unambiguous trigger + a little wait for sentences that express the last few parameters unambiguously

The logarithm of expected number of sentences consumed by the waiting-STL in each state after learning has started. • e = 10, r = 30, and e΄ = 0.2 (a΄ = 0.8) Logarithmic scale closer and closer to convergence

STL — Bad News Ambiguity is damaging even to a parametrically-principled learner Abiding by the Parametric Principle does not does not, in and of itself, guarantee merely linear increase in the complexity of the learning task as the number of parameters increases. .

Can learn Can't learn STL — Good News Part 1 Learning task might be manageable if there are at least some sentences with low expression to get learning off the ground. Parameters expressed per sentence Fodor (1998); Sakas and Fodor (1998)

Add a distribution factor to the transition probabilities = Probability that iparameters are expressed by a sentence given distribution D on input text I

Average number of inputs consumed by the waiting-STL. Expression rate is not fixed per sentence. e varies uniformly from 0 to emax Still exponential in % ambiguity, but manageable For comparison: e varies from 0 to 10 requires 430 sentences e fixed at 5 requires 3,466 sentences

Effect of ambiguity is still exponential, but not as bad as for fixed e. r = 20, e is uniformly distributed from 0 to 10. Logarithmic scale

Effect of high ambiguity rates Varying rate of expression, uniformly distributed Still exponential in a, but manageable larger domain than in previous tables

STL — Good News Part 2 With a uniformly distributed expression rate, the cost of the Parametric Principle is linear (in r )and doesn’t skyrocket Linear scale

In summary: With a uniformly distributed expression rate, the number of sentences required by the STL falls in a manageable range (though still exponential in % ambiguity) The number of sentences increases only linearly as the number of parameters increases (i.e. the number of grammars increases exponentially).

Conjecture (roughly in the spirit of Schaffer, 1994): • Algorithms may be extremely efficient in in specific domains ‘sweet spot’ but not in others. • Recommends: • Have to know the specific facts about the distribution of ambiguity in natural language.

Research agenda: Three-fold approach to building a cognitive computational model of human language acquisition: 1) formulate a framework to determine what distributions of ambiguity make for feasible learning 2) conduct a psycholinguistic study to determine if the facts of human (child-directed) language are in line with the conducive distributions 3) conduct a computer simulation to check for performance nuances and potential obstacles (e.g. local max based on defaults, or subset principle violations)

Modeling the Effect of Cross-Language Ambiguity on Human Syntax Acquisition