540 likes | 645 Views
Why you will remember this talk and a Neural Network would not... Or The Problem (and a Solution to) Catastrophic Interference in Neural Networks. Robert M. French LEAD-CNRS UMR 5022 Dijon, France. Organization of this talk. What is catastrophic forgetting and why does it occur?
E N D
Why you will remember this talk and a Neural Network would not... Or The Problem (and a Solution to) Catastrophic Interference in Neural Networks Robert M. French LEAD-CNRS UMR 5022 Dijon, France
Organization of this talk • What is catastrophic forgetting and why does it occur? • A dual-network technique using pseudopattern information transfer for overcoming it for multiple pattern-learning. • What is “pseudopattern information transfer”? • What are “reverberated” pseudopatterns? • What about sequence-learning? • What about learning multiple sequences? • Applications, theoretical questions and future directions
The problemNEURAL NETWORKS FORGET CATASTROPHICALLY • Learning new information can may completely destroy previously learned information. This makes sequential learning — i.e., learning one thing after another, the way humans learn — impossible. • Can sequential learning be modeled so that: • i) New information does not interfere catastrophically with already-learned information? • and • ii) Without keeping previously learned items around?
Barnes-Underwood (1959) forgetting paradigm Subjects learn List A-B (non-word/word pairs): pled – table splog – book bim – car milt – bag etc. until they have all pairs learned. Then they learn List A-C (same non-words, different real word) pled – rock splog – cow bim – square milt – clock etc. first set of target items same items different target items here
How artificial neural networks (backprop) do on this task
WHY does this occur? Answer: Overlap of internal representations “Catastrophic forgetting is a direct consequence of the overlap of distributed representations and can be reduced by reducing this overlap.” (French, 1991)
Answer: • Two separate networks in continual interaction: • one for long-term storage, • one for immediate processing of new patterns. • (French, 1997; etc.) How can we reduce this overlap of internal representations? Also, this seems to be the solution discovered by the brain. Hippocampus – Neocortex. (McClelland, McNaughton, & O’Reilly, 1995)
Implementation • We have implemented a “dual-network” system using coupled networks that completely solves this problem (French, 1997; Ans & Rousset, 1997, 2000; Ans, Rousset, French, & Musca, 2002, in press). • These two separate networks exchange information by means of “reverberated pseudopatterns.”
Pseudopatterns? Reverberated pseudopatterns? What?
f(x) Pseudopatterns • Assume a network-in-a-box learns a series of patterns produced by a function f(x). • These original patterns are no longer available. How can you approximate f(x)?
1 0 0 1 1 Random Input
1 1 0 Associated output 1 0 0 1 1 Random Input
1 1 0 Associated output 1 0 0 1 1 Random Input This creates a pseudopattern: 1: 1 0 0 1 1 1 1 0
A large enough collection of these pseudopatterns: 1: 1 0 0 1 1 1 1 0 2: 1 1 0 0 0 0 1 1 3: 0 0 0 1 0 1 0 0 4: 0 1 1 1 1 0 0 0 Etc will approximate the originally learned function.
Transferring information from Net 1 to Net 2 with pseudopatterns Associated output 1 1 0 target 1 1 0 Net 2 Net 1 input 1 0 0 1 1 Random input 1 0 0 1 1
target 1 1 1 random input 1 1 0 1 1 1 1 1 1 1 0 1 1 Learning new information in Net 1 with pseudopatterns from Net 2 New pattern to learn: 00111 010 0 1 0 Target Net 2 Net 1 Net 1 0 0 1 1 1 New input +
target 0 0 1 random input 1 1 0 0 0 1 1 1 1 1 0 1 1 + 0 0 1 1 1 0 0 0 + Etc. 0 1 0 Target Net 2 Net 1 0 0 1 1 1 New input +
This is how information is continually transferred between the two networks by means of pseudopatterns.
Sequential Learning using the dual-network approach Sequential learning of 20 patterns – one after the other (French, 1997)
On to reverberated pseudopatterns... Even though the simple dual-network system (i.e., new learning in one network; long-term storage in the other) using simple pseudopatterns does eliminate catastrophic interference, we can do better using “reverberated” pseudopatterns.
Building a Network that uses “reverberated” pseudopatterns. Start with a standard backpropagation network Output layer Hidden layer Input layer
Add an autoassociator Output layer Hidden layer Input layer
A new pattern to be learned, P: Input Target, will be learned as shown below. Input Target Input
We start with a random input î0, feed it through the network and collect the output on the autoassociative side of the network.. This output is fed back into the input layer (“reverberated”) and, again, the output on the autoassociative side is collected. This is done R times.
After R reverberations, we associate the reverberated input and the “target” output. This forms the reverberated pseudopattern:
Net 2 Storage network Net 1 New-learning network This dual-network approach using reverberated pseudopattern information transfer between the two networks effectively overcomes catastrophic interference in multiple-pattern learning
But what about multiple-sequence learning? • Elman networks are designed to learn sequences of patterns. But they forget catastrophically when they attempt to learn multiple sequences. • Can we generalize the dual-network, reverberated pseudopattern technique to dual Elman networks and eliminate catastrophic interference in multiple-sequence learning? Yes
The Problem of Multiple-Sequence Learning • Real cognition requires the ability to learn sequences of patterns (or actions). (This is why SRN’s – Elman Networks – were originally developed.) • But learning sequences really means being able to learn multiplesequences without the most recently learned ones erasing the previously learned ones. • Catastrophic interference is a serious problem for the sequential learning of individual patterns. It is far worse when multiple sequences of patterns have to be learned consecutively.
Elman networks (a.k.a. Simple Recurrent Networks) S(t+1) Copy hidden unit activations from previous time-step Hidden H(t) Standard input S(t) Context H(t-1) Learning a sequence S(1), S(2), …, S(n).
A “Reverberated Simple Recurrent Network” (RSRN): an Elman network with an autoassociative part
RSRN technique for sequentially learning two sequences A(t) and B(t). • Net 1 learns A(t) completely. • Reverberated pseudopattern transfer to Net 2. • Net 1 makes one weight-change pass through B(t). • Net 2 generates a single “static” reverberated pseudopattern • Net 1 does one learning epoch on this pseudopattern from Net 2. • Continue until Net 1 has learned B(t). • Test how well Net 1 has retained A(t).
Two sequences to be learned: A(0), A(1), … A(10) and B(0), B(1), … B(10) Net 1 Net 2 Net 1 learns (completely) sequence A(0), A(1), …, A(10)
1110010011010 : 010110100110010 Transferring the learning to Net 2 1110010011010 1110010011010 Teacher Net 1 Net 2 010110100110010 010110100110010 Input
Transferring the learning to Net 2 1110010011010 Teacher Net 1 Net 2 feedforward 010110100110010 Input
Transferring the learning to Net 2 1110010011010 Teacher Backprop weight change Net 1 Net 2 010110100110010 Input Repeat for 10,000 pseudopatterns produced by Net 1.
Learning B(0), B(1), … B(10) by NET 1 Net 1 Net 2 1. Net 1 does ONE learning epoch on sequence B(0), B(1), …, B(10) 2. Net 2 generates ONE pseudopattern: NET 2 3. Net 1 does one FF-BP pass on NET 2
Learning B(0), B(1), … B(10) by NET 1 Net 1 Net 2 1. Net 1 does ONE learning epoch on sequence B(0), B(1), …, B(10) 2. Net 2 generates ONE pseudopattern: NET 2 3. Net 1 does one FF-BP pass on NET 2 Continue until Net 1 has learned B(0), B(1), …, B(10)
Test method • First, sequence A is completely learned by the network. • Then sequence B is learned. • During the course of learning, we monitor at regular intervals how much of sequence A has been forgotten by the network.
Normal Elman networks: Catastrophic forgetting(height of bars equals how much forgetting has occurred). By 450 epochs sequence B has been completely learned. However, the SRN has, for all intents and purposes, completely forgotten the previously learned sequence A
As Sequence B is being learned, recall performance for Sequence A in the Dual-RSRN model By 400 epochs, the second sequence B has been completely learned. The previously learned sequence A shows virtually no forgetting. Forgetting – not just catastrophic forgetting – of the previously learned sequence Ahas been completely overcome.
Normal Elman Network: Massive forgetting % Error on Sequence A Dual RSRN: No forgetting of Sequence A
Cognitive/Neurobiological plausibility? • The brain, somehow, does not forget catastrophically. • Separating new learning from previously learned information seems necessary. • McClelland, McNaughton, O’Reilly (1995) have suggested the hippocampal-neocortical separation may be Nature’s way of solving this problem. • Pseudopattern transfer is not so far-fetched if we accept results that claim that neo-cortical memory consolidation, is due, at least in part, to REM sleep.
2-network RSRN Old sequences (% correct) 100 80 60 40 20 0 0 1 5 10 20 Number of presentations of the new sequence Humans Old sequences (% correct) 100 80 60 40 20 0 0 1 5 10 20 Number of presentations of the new sequence Prediction of the model : "Recall rebound" Empirical data: Recall rebound confirmed
Examples of the "recall rebound" in the real world • Learning a new language: initial drop in performance for the first language, followed by regaining of initial levels of performance. • Learning a new piece of music • Learning new motor activities, etc.
In case you missed it... What is so interesting about the RSRN procedure is that by means of a number of “static” input-output patterns (pseudopatterns), we can transfer sequential information into another network. In other words, a Sequence: A-B-C-D-B-E-C-F-G of actions, words, patterns, etc. can be transferred by means of a set of I/O patterns. OK, cute. But why is this so interesting?