Artificial Dreams Lecture Series in Cognitive Science

Artificial DreamsLecture Series in Cognitive Science DanialQaurooni-Fard Cognitive Science Group Amirkabir University of Technology Winter 2009.

Outline

Language Learning: Connectionist AI A Survey of Connectionist Models of Language Learning

Connectionism • Approach: Connectionism • Main theme: Neural-inspired networks • Example systems: (Language Learning) • English past tense (Plunkett and Marchman 1991) • NETtalk (Sejnowski and Rosenberg 1980s) • SRNs (Elman 1990) • LSA (Landauer, Folz and Laham 1998) • Aphasia Model (Dell 1997)

Language and AI • Language in Microworlds • Language as a Canned Product • Language as a set of Subtasks

1st: Language in Microworlds • Limiting language to specific domains. • 1967: STUDENT • Solved simple algebraic problems. • “What’s 5 plus 4?” • 1972: SHRDLU • Simulation of a robotic hand that worked with colored geometrical objects. • “Find a block which is taller than the one that you are holding and put it into a box.” • 1997: Jupiter • Weather forecasting system. • “Is Boston Cloudy today?”

2nd: Language as a Canned Product • Engage in “natural” conversation with a limited vocabulary that seems unlimited! • 1965: Eliza • Psychiatrist. • 2002: Claire • “Virtual service representative” for a telephone company. • “Let me get someone to help you!” • 1960s-Present: Translation • Inception: 1957 launch of Sputnik. • “The spirit is willing but the flesh is weak.” • 1960s: “The vodka is good but the meat is rotten.” • 2003: “The spirit is ready, but the meat is weak.”

3rd: Language as a Set of Subtasks • Break the problem into a set of subtasks like • speech processing • text reading • grammar acquisition … and the trainee will pick up patterns. • Roughly the tack of connectionists. • Why study language and connectionism? • Connectionism has fared well • But maybe such tasks as language and reasoning cannot be accomplished by associative methods alone • So maybe connectionists are unlikely to match the performance of classical models at explaining these higher-level cognitive abilities.

Connectionism • 1960s: Rule-and-symbol AI: CYC • 1980s: PDP(Parallel Distributed Processing) • Neural Networks: At least distantly inspired by the architecture of the brain. • Abstracted away: • Multiplicity of types of neurons and synapses • Use of temporal properties • Connectivity constraints • Move the vocabularies of the various sciences of the mind closer together.

Connectionism • Text-to-phoneme conversions: • DECtalk vs. NETtalk • Neither “understood” anything! • Connectionist models: Both a boon and a burden: • “Good at Frisbee, bad at logic” • Boon: motor control, face recognition, reading handwritten zip codes! • Burden: sequential reasoning, long-term planning, logic. • Substitutes pattern recognition for classical reasoning. • Humans ARE usually better at Frisbee than at logic.

Simple 3 Layer Network

A Connectionist at Work • Case of past tenses of English verbs. • Regular formation: stem + “ed” • Irregulars: • No change: hit >> hit • Vowel change: ring >> rang • Arbitrary: go >> went • Overregularization in children(“go” + “ed” >> “goed”) • U-shaped learning profile. • Nativists: rules and associative memory. • Language Acquisition Paradox • Universal Grammar. • Connectionists: (Plunkett and Marchman 1991) Mimic the u-shaped learning curve.

Plunkett and Marchman(1991) • Standard feed forward network • Maps a phonological representation of the stem to a phonological representation of the past tense • Initially: 10 regular & 10 irregulars • Total: 500 stems, 90% regular • Final model successfully learned the 500 verbs in the training set 20 phonological units 30 hidden units 20 phonological units

What P&M Had To Do • Decide on a manner of breaking the domain into objects, features … - in this case, into verb stems, suffixes, and inflections; • Decide on encoding and presentation of the above to the network; • Design the architecture – that is the number of layers, nodes etc; • Decide on the activation rule, the output function, the learning regimen, and so on; • Select a n appropriate corpus of data – in this case effective combination of regular and irregular verbs; • Carefully control the order and frequency of the presentation of the verbs to the network; • Train network on a sample of five hundred verbs; • Decide on the number of times (“epochs”) a set of input data should be presented to the network

Connectionism: Intuitions • Architecture-over-function • Decoupling • Learning • Pre-wiring • Neural Reductionism • Adequacy

Basic Intuitions A 1st Pass on Connectionism

Connectionism: Basic Intuitions • Architecture over function • Mimic human brains • Cognition isarchitecture-dependent • Architecture is primary and function is secondary. • U-shaped learning profile.

Connectionism: Basic Intuitions • Decoupling: Connected inside, disconnected from outside • Still representational, but of a distributed, implicit kind. • Eventually, certain groups of neurons behave as if they encode certain features. • Emergent Behavior • Constraint-satisfaction rather than goal-achievement • “Soft” constraints can be satisfied due to large degrees of freedom. • Survive “attacks”, “lesions” and “decay”.

Connectionism: Basic Intuitions • Learning • Useful distinctions have to be made. • Inputs are not passive! • The flat surface of a rock provides different “affordances” • “climbability” for a deer • “sittability” for a hiking human • “steppability” for the same human crossing a creek • “the primitive units [categories] are not input, nor are they built in as primitives.”

Learning • Learning is generalization. • “The capability for generalization in human beings crucially involves the ability to reperceive and rearrange things in novel ways.” • Flexibility requires a hierarchy. • Connectionists try to break up the homogeneity, add new layers and specialized modules. • 1990s: Add context layers.

Recurrent Networks A New Approach

Elman’s Recurrent Networks • Most linguistic behavior happens in time. • Classic connectionist models receive input all at once (i.e. Plunkett and Marchman’s past-tense learning model) • Recurrent networks take the internal state and copy it to the input, creating “memory”. • SRN: |input|= |original input| + |hidden layer|

SRN

Dissecting Language • Segmentation problem: How do children discover the atoms of language? • Words, morphemes, phonemes… • These atoms are NOT given. • Distinctions are often murky and unclear. • Dissecting language is a metalinguistic task.

Elman: Discovering “Word” • “Word”: chunks or clusters of letters. • Network Structure: SRN with 5 input, 20 hidden,5 output and 20 context units. • Input: bit vector of length 5. • Training: 200 sentences were generated and concatenated to form 1,270 words or roughly 4,963 letters. • “Many years ago a boy and a girl lived together” • Output: bit vector of length 5.

Discovering “Word” • After 10 epochs the network started to make predictions. • More errors at word boundaries than within words. • Cannot possibly be mere co-occurrence statistics? • Still not a model of word acquisition. • A cue to the boundaries that define the units which must be learned.

Discovering “Lexical Classes” • Elman next considered the problem of discovering nouns, verbs, etc. • Network Structure: 31 input, 150 context and hidden and 31 output units. • Input: 31-bit vectors representing 29 words. • 10,000 two- and three-word sentences were generated. • Output: A 31-bit vector. • Different types of verbs and nouns were used: transitive/ intransitive, perception/sensation, human/animate/inanimate…

Discovering “Lexical Classes” • After 6 epochs the output achieved the desired level of accuracy. • Using hierarchical clustering analysis revealed the following clusters: • 1st level: Words denoting humans vs. animals. • 2nd level: Words denoting animate vs. inanimate objects. • 3rd level: Words denoting nouns vs. verbs. • …

Pre-wiring A 2nd Pass on Connectionism

Connectionism: Basic Intuitions • Pre-wiring: Advanced Tinkering • Experiment with embedded sentence structures.(Hierarchy again!) • “Boys who chase dogs see girls.” • Task: Predict the next word. • Tinkering with the Input: Incremental increase in input complexity. • Tinkering with the Network: Incremental increase in network complexity. • Allow the network to go through maturational changes. • In this case: Increase memory capacity.

Starting Small: Less Is More • Constrain the “solution space” • The learner deals with limited variance. • i.e. variance in number, grammatical category, verb type etc. • “The girl who the dogs that I chased down the block frightened ran away.” • This makes further learning easier. (or even possible?)

SRN

Hidden Unit Space(3 of 70 dimensions) • The “learned” network partitions the state space such that certain spatial dimensions signal • Differences between nouns and verbs • Singular vs. plural • Depth of Embedding • …

Starting Large • Rhode and Plaut(1999) • Reported opposite conclusions(!) with a similar task, input, and architecture to Elman’s. • Starting large: They employed “a more naturalistic language…through the addition of semantic constraints.” • Co-occurrence of certain verbs and nouns • Transitive verbs only act on certain objects. • No gradual increase either in complexity or capacity. • Starting small or large?

How To Evaluate? • Rhode and Plaut: • “there was significant advantage for starting with the full language.” • “we do not yet fully understand what led Elman to succeed in these simulations when we failed.” • Elman’s networks were not allowed enough training time. • Elman’s chosen learning parameters resulted in poor performance. • Given appropriate training parameters, an SRN can effectively learn without external preparation.

Nativist vs. Statistical Approaches • Frequency of occurrence or occurrence per se? • Chomsky’s competence/performance distinction. • Connectionists erase the distinction and lose the evaluation criteria with it! • Are connectionist models too idealized and abstract in terms of meaning and context • Elman’s reply: Case of “man” and “zog”. • Starting Small/Large hypothesis vs. LAD

2 Approaches to Neural Networks • Symbolic Approximation: (ImplementationalConnectionsim) • Symbolic theories are roughly to connectionist models what classical mechanics is to quantum mechanics. • The former is a high-level compression of the latter. • Statistical Inference Machines: • Language as a Bag of Words: LSA.

Latent Semantic Analysis • Landauer, Folz and Laham: 1998. • LSA provides “a method for determining the similarity of meaning of words and passages by analysis of large text corpora”. • The meaning of a word is “a kind of average of the meaning of all the passages in which it appears,” and the meaning of a passage is “a kind of average of the meaning of all the words it contains”.

LSA Processing steps words word-by-context matrix reconstruction matrix Rank Lowering documents ‘concepts’

LSA: Rank Lowering • The low-rank approximation may be preferred because the original matrix • May be too large to compute • Is presumed noisy • Overly sparse. • Thus it mitigates • Polysemy: Components of polysemous words are added to the components of words that share the meaning. • Synonomy: Expected to merge the dimensions ofwords associated with similar meanings.

LSA: A Bag of Words A word-by-context matrix

LSA: A Bag of Words Reconstruction matrix

LSA: A Bag of Words • Based on constructing matrices that contain information about the correlations among words and passages. • The dot product of gives the correlation between two terms. The matrix product contains all the correlation. • Enter Vectorial semantics.

LSA: SVD • Singular Value Decomposition (SVD): Assume that there exists a decomposition of X such that U and V are orthogonal and is a diagonal matrix: X = UΣVT Left Singular Vector Right Singular Vector Singular Values

LSA: SVD • When the k largest singular values and their left and right singular vectors are selected, rank k approximation to X is achieved. Also this • has minimal error rate and • creates a concept space. Left Singular Vector Right Singular Vector Singular Values

LSA Processing steps words word-by-context matrix reconstruction matrix documents ‘concepts’

LSA: Applications • Semantic Clustering: “physician”, “bedside” and “patient”. • Finding similar documents across languages. • Finding relations between terms. • Trivia: LSA scored 60% on a multiple choice psychology comprehension tests after educating itself.

Of Nuns, Sex and Content • LSA might be regarded as • A tool for text analysis. • A model for acquisition and representation of knowledge. • LSA, worries: • Word order? • Context? • Landauer and Dumais (1997) “One might consider LSA’s maximal knowledge of the world to be analogous to a well-read nun’s knowledge of sex, a level of knowledge often deemed a sufficient basis for advising young”

Adequacy A 3rd Pass on Connectionism

Adequacy • Levelt(1989) recognizes 3 components of language production: Takes a “message” and turns it into liguistic form Formation Formulation Articulation A “message”, a nonverbal representation, is formed. Movement of the articulatory organs for producing sounds

Artificial Dreams Lecture Series in Cognitive Science