74.419 Artificial Intelligence 2004

74.419 Artificial Intelligence 2004 - Neural Networks - • Neural Networks (NN) • basic processing units • general network architectures • learning • qualities and problems of NNs

Neural Networks – Central Concepts biologically inspired • McCulloch-Pitts Neuron (automata theory), Perceptron basic architecture • units with activation state, • directed weighted connections between units • "activation spreading", output used as input to connected units basic processing in unit • integrated input: sum of weighted outputs of connected “pre-units” • activation of unit = function of integrated input • output depends on input/activation state • activation function or output function often threshold dependent, also sigmoid (differentiable for backprop!) or linear

Anatomy of a Neuron

Diagram of an Action Potential From: Ana Adelstein, Introduction to the Nervous System, Part I http://www.ualberta.ca/~anaa/PSYCHO377/PSYCH377Lectures/L02Psych377/

General Neural Network Model • Network of simple processing units (neurons) • Units connected by weighted links (labelled di-graph; connection matrix)

Neuron Model as FSA

NN - Activation Functions Sigmoid Activation Function Threshold Activation Function (Step Function) adapted from Thomas Riga, University of Genoa, Italy http://www.hku.nl/~pieter/EDU/neuro/doc/ThomasRiga/ann.htm#pdp

Parallelism – Competing Rules

NN Architectures + Function Feedforward, layered networks • simple pattern classification, function estimating Recurrent networks • for space/time-variant input (e.g. natural language) Completely connected networks • Boltzman Machine, Hopfield Network • optimization; constraint satisfaction Self-Organizing Networks • SOMs, Kohonen networks, winner-take-all (WTA) networks • unsupervised development of classification • best-fitting weight vector slowly adapted to input vector

NN Architectures + Function Feedforward networks • layers of uni-directionally connected units • strict forward processing from input to output units • simple pattern classification, function estimating, decoder, control systems Recurrent networks • Feedforward network with internal feedback (context memory) • processing of space/time-variant input, e.g. natural language • e.g. Elman networks

Feed-forward Network Haykin, Simon: Neural Networks - A Comprehensive Foundation, Prentice-Hall, 1999, p. 22.

NN Architectures + Function Completely connected networks • all units bi-directionally connected • positive weight  positive association between units; units support each other, are compatible • optimization; constraint satisfaction • Boltzman Machine, Hopfield Network Self-Organizing Networks • SOMs, Kohonen networks, also winner-take-all (WTA) networks • best-fitting weight vector slowly adapts to input vector • unsupervised learning of classification

Neural Networks - Learning Learning = change connection weights adjust connection weights in network, changes input-output behaviour, make it react “properly” to input pattern • supervised = network is told about “correct” answer = teaching input; e.g. backpropagation, reinforcement learning • unsupervised = network has to find correct output (usually classification of input patterns) on it’s own; e.g. competitive learning, winner-take-all networks, self-organizing or Kohonen maps

Backpropagation - Schema Backpropagation - Schematic Representation The input is processed in a forward pass. Then the error is determined at the output units and propagated back through the network towards the input units. adapted from Thomas Riga, University of Genoa, Italy http://www.hku.nl/~pieter/EDU/neuro/doc/ThomasRiga/ann.htm#pdp

Backpropagation Learning Backpropagation Learning is supervised Correct input-output relation is known for some pattern samples; take some of these patterns for training:calculate error between produced output and correct output; propagate error back from output to input units and adjust weights. After training perform tests with known I/O patterns. Then use with unknown input patterns. Idea behind the Backpropagation Rule (next slides): Determine error for output units (compare produced output with 'teaching input' = correct or wanted output). Adjust weights based on error, activation state, and current weights. Determine error for internal units based on the derivation of activation function. Adjust weights for internal units using the error function, using an adapted delta-rule.

NN-Learning as Optimization Learning: adjust network in order to adapt its input-output behaviour so that it reacts “properly” to input patterns Learning as optimization process: find parameter setting for network (in particular weights) which determines network that produces best-fitting behaviour (input-output relation)  minimize error in I/O behaviour  optimize weight setting w.r.t error function  find minimum in error surface for different weight settings Backpropagation implements a gradient descent search for correct weight setting (method not optimal) Statistical models (include a stochastic parameter) allow for “jumps” out of local minima (cf. Hopfield Neuron with probabilistic activation function, Thermodynamic Models with temperature parameter, Simulated Annealing) Genetic Algorithms can be used to determine parameter setting of Neural Network.

Backpropagation - Delta Rule The Error is calculated as erri = (ti - yi) where ti is the teaching input (the correct or wanted output) yiis the produced output Note: In the textbook it is called (Ti - Oi) Backpropagation- or delta-rule: wj,i wj,i +  • aj • i where  is a constant, the learning rate, aj is the activation of ujand i is the backpropagated error. i = erri• g' for units in the output layer j = g' (xj) •  wj,i •i for internal hidden units Where g' is the derivative of the activation function g. Then wk,j wk,j +  • xk • j

Backpropagation as Error Minimization Find Minimum of the Error function E = 1/2 •i (ti - yi)2 Transform the above formula by integrating the weights (substitute the output term yi with g( wj,i •aj) = sum of weighted outputs of pre-neurons): E(W) = 1/2 •i (ti - g(jwj,i •aj))2 where W is the complete weight matrix for the net. Determine the derivative ofthe error function (the gradient) w.r.t to a single weight wk,j : dE / dwk,j = -xk•j To minimize the error, take the inverse of the gradient (+xk•j). This yields the Backpropagation- or delta-rule: wk,j wk,j +  • xk • j

Implementation of Backprop-Learning • Choose description of input and output patterns which is suitable for the task. • Determine test set and training set (disjoint sets) • Do – in general thousands of – training runs (with various patterns) until parameters of the NN converge. • The training goes several times through the different pattern classes (outputs), either one class at a time or one pattern from each class at a time. • Measure performance of the network for test data (determine error – wrong vs. right reaction of NN) • re-train if necessary

Competitive Learning 1 Competitive Learning is unsupervised. Discovers classes in the set of input patterns. Classes are determined by similarity of inputs. Determines (output) unit which responds to all sample inputs of the same class. Unit reacts to patterns which are similar and thus represents this class. Different classes are represented by different units. The system can thus - after learning - be used for classification.

Competitive Learning 2 Units specialize to recognize pattern classes Unit which responds strongest (among all units) to the current input, moves it's weight vector towards the input vector (use e.g. Euclidean distance): • reduce weight on inactive lines, raise weight on active lines • all other units keep or reduce their weights (often a Gaussian curve used to determine which units change their weights and how) Winning units (their weight vectors) represent a prototype of the class they recognize.

Competitive Learning - Figure from Haykin, Simon: Neural Networks, Prentice-Hall, 1999, p. 60

Example: NetTalk (from 319) • Terry Sejnowski of Johns Hopkins developed a system that can pronounce words of text • The system consists of a backpropagation network with 203 input units (29 text characters, 7 characters at a time), 80 hidden units, and 26 output units • The system was developed over a year • The DEC-talk system consists of hand-coded linguistic rules for speech pronunciation • developed over approximately 10 years • DEC-talk outperforms NETtalk but DEC-talk required significantly more development time

NetTalk (from 319) • "This exemplifies the utility of neural networks; they are easy to construct and can be used even when a problem is not fully understood. However, rule-based algorithms usually out-perform neural networks when enough understanding is available” • Hertz, Introduction to the Theory of Neural Networks, p. 133

NETtalk - General • Feedforward network architecture • NETtalk used text as input • Text was moved over input units ("window")  split text into fixed length input with some overlap between adjacent text windows • Output represents controls for Speech Generator • Training through backpropagation • Training Patterns from human-made phonetic transcripts

NETtalk - Processing Unit

NETtalk - Network Architecture

NETtalk - Some Articulatory Features (Output)

NN - Caveats 1 often 3 layers necessary • Perceptron, Minsky&Papert’s analysis • linearly separable pattern classes position dependence • visual pattern recognition can depend on position of pattern in input layer / matrix • introduce feature vectors (pre-analysis yields features of patterns; features input to NN) time- and space invariance • patterns may be stretched / squeezed in space / time dimension (visual objects, speech)

NN - Caveats 2 Recursive structures and functions • not directly representable due to fixed architecture (fixed size) • move window of input units over input (which is larger than input window) • store information in hidden units ("context memory") and feedback into input layer • use hybrid model Variable binding and value assignment • simulation possible through simultaneously active, synchronized units (cf. Lokendra Shastri)

Additional References Haykin, Simon: Neural Networks – A Comprehensive Foundation, Prentice-Hall, 1995. Rumelhart, McClelland & The PDP Research Group: Parallel Distributed Processing. Explorations into the Microstructures of Cognition, The MIT Press, 1986.

Neural Networks Web Pages The neuroinformatics Site (incl. Software etc.) http://www.neuroinf.org/ Neural Networks incl. Software Repository at CBIIS (Connectionist-Based Intelligent Information Systems), University of Otago, New Zealand http://divcom.otago.ac.nz/infosci/kel/CBIIS.html Kohonen Feature Map - Demo http://rfhs8012.fh-regensburg.de/~saj39122/begrolu/ kohonen.html

Neurophysiology / Neurobiology Web Pages Animated diagram of an Action Potential (Neuroscience for Kids - featuring the giant axon of the squid) http://faculty.washington.edu/chudler/ap.html Adult explanation of processes involved in information transmission on the cell level (with diagrams but no animation) http://users.rcn.com/jkimball.ma.ultranet/BiologyPages/E/ExcitableCells.html Similar to above but with animation and partially spanish http://www.epub.org.br/cm/n10/fundamentos/pot2_i.htm

Neurophysiology / Neurobiology Web Pages Kandel's Nobel Lecture "Molecular Biology of Memory Storage: A Dialogue Between Genes and Synapses," December 8, 2000 http://www.nobel.se/medicine/laureates/2000/kandel-lecture.html The Molecular Sciences Institute,Berkeley http://www.molsci.org/Dispatch The Salk Institute for Biological Studies, San Diego http://www.salk.edu/

74.419 Artificial Intelligence 2004 - Neural Networks -