Modelling Language Evolution Lecture 2: Learning Syntax

Modelling Language EvolutionLecture 2: Learning Syntax Simon Kirby University of Edinburgh Language Evolution & Computation Research Unit

Multi-layer networks • For many modelling problems, multi-layer networks are used • Three layers are common: • Input layer • Hidden layer • Output layer • What do the hidden-node activations correspond to? • Internal representation • For some problems, networks need to compute an “intermediate” representation of the data

XOR network - step 1 • XOR is the same as OR but not AND • Calculate OR • Calculate NOT AND • AND the results AND NOT AND OR

XOR network - step 2 OUTPUT BIAS NODE -7.5 AND -7.5 5 5 7.5 HIDDEN 1 HIDDEN 2 NOT AND OR 10 10 -5 -5 INPUT 1 INPUT 2

Simple example (Smith 2003) • Smith wanted to model a simple language-using population • Needed a model that learned vocabulary • 3 “meanings” (1 0 0), (0 1 0), (0 0 1) • 6 possible signals (0 0 0), (1 0 0) , (1 1 0) … • Used networks for reception and production: SIGNAL MEANING Train Perform MEANING SIGNAL • After training, knowledge of language stored in the weights • During reception/production, internal representation is in the activations of the hidden nodes

Can a network learn syntax? (Elman 1993) • Important question for the evolution of language: • Modelling can tell us what we can do without • Can we model the acquisition of syntax using a neural network? • One problem… sentences can be arbitrarily long How much knowledge of grammar are we born with?

Representing time • Imagine we presented words one at a time to a network • Would it matter what order the words were give? • No: Each word is a brand new experience • The net has no way of relating each experience with what has gone before • Needs some kind of working memory • Intuitively: each word needs to be presented along with what the network was thinking about when it heard the previous word

The Simple Recurrent Net (SRN) • At each time step, the input is: • a new experience • plus a copy of the hidden unit activations at the last time step Output Copy back connections Hidden Input Context

What inputs and outputs? • How do we force the network to learning syntactic relations? • Can we do it without an external “teacher”? • Answer: the next-word prediction task • Inputs: Current word (and context) • Outputs: Predicted next word • The error signal is implicit in the data

Long distance dependencies and hierarchy • Elman’s question: how much is innate? • Many argue: • Long-distances dependencies and hierarchical embedding are “unlearnable” without innate language faculty • How well can an SRN learn them? • Examples: • boys who chase dogs see girls • cats chase dogs • dogs see boys who cats who mary feeds chase • mary walks

First experiments • Each word encoded as a single unit “on” in the input.

Initial results • How can we tell if the net has learned syntax? • Check whether it predicts the correct number agreement • Gets some things right, but makes many mistakes boys who girl chase see dog • Seems not to have learned long-distance dependency.

Incremental input • Elman tried teaching the network in stages • Five stages: • 10,000 simple sentences (x 5) • 7,500 simple + 2,500 complex (x 5) • 5,000 simple + 5,000 complex (x 5) • 2,500 simple + 7,500 complex (x 5) • 10,000 complex sentences (x 5) • Surprisingly, this training regime lead to success!

Is this realistic? • Elman reasons that this is in some ways like children’s behaviour • Children seem to learn to produce simple sentences first • Is this a reasonable suggestion? • Where is the incremental input coming from? • Developmental schedule appears to be a product of changing the input.

Another route to incremental learning • Rather than the experimenter selecting simple, then complex sentences, could the network? • Children’s data isn’t changing… children are changing • Elman gets the network to change throughout its “life” • What is a reasonable way for the network to change? • One possibility: memory

Reducing the attention span of a network • Destroy memory by setting context nodes to 0.5 • Five stages of learning (with both simple and complex sentences): • Memory blanked every 3-4 words (x 12) • Memory blanked every 4-5 words (x 5) • Memory blanked every 5-6 words (x 5) • Memory blanked every 6-7 words (x 5) • No memory limitations (x 5) • The network learned the task.

Counter-intuitive conclusion: starting small • A fully-functioning network cannot learn syntax. • A network that is initially limited (but matures) learns well. • This seems a strange result, suggesting that networks aren’t good models of language learning after all • On the other hand… • Children mature during learning • Infancy in humans is prolonged relative to other species • Ultimate language ability seems to be related to how early learning starts • i.e., there is a critical period for language acquisition.

Individual learning Cultural evolution Biological evolution Next lecture • We’ve seen how we can model aspects of language learning in simulations • What about evolution?

Modelling Language Evolution Lecture 2: Learning Syntax