790 likes | 1.22k Views
From Biological to Artificial Neural Networks. NPCDS/MITACS Spring School, Montreal, May 23-27, 2006 Helmut Kroger Laval University . Outline 1. From biology to artifial NNs 2. Perceptron – unsupervised learning 3. Hopfield model – associative memory 4. Kohonen map – self organization.
E N D
FromBiologicaltoArtificialNeuralNetworks NPCDS/MITACS Spring School, Montreal, May 23-27, 2006 Helmut Kroger Laval University
Outline 1. From biology to artifial NNs 2. Perceptron – unsupervised learning 3. Hopfield model – associative memory 4. Kohonen map – self organization
References Hertz J., Krogh A., and Palmer R.G. Introduction to the theory of neural computation Pattern Classification, John Wiley, 2001R.O. Duda and P.E. Hart and D.G. Stork Haykin S (1999). Neural networks. Prentice Hall International. Bishop C (1995). Neural networks for pattern recognition. Oxford: Clarendon Press
Pattern Recognition and Neural Networksby Brian D. Ripley. Cambridge University Press. Jan 1996. Neural Networks. An Introduction, Springer-Verlag Berlin, 1991 B. Mueller and J. Reinhardt W.S. McCulloch & W. Pitts (1943). “A logical calculus of the ideas immanent in nervous activity”, Bulletin of Mathematical Biophysics, 5, 115-137.
Use of NNs:Neural Networks Are For Applications Science Character recognition Neuroscience Optimization Physics, Mathematics Data mining Computer science … …
Biological neural networks Nerve cells are called neurons. Many different types exist. Neurons are extremely complex. • Approx. 1011 neurons in the brain. Each neuron has about 103 connections
Neural communication Neurons transport information via electrical action potentials. At the synapse the transmission is mediated by chemical macromolecules (neurotransmitter proteins).
Biology of neurons: • Single neurons are highly complex electrochemical devices • Many forms of interneuron communication now known – acting over many different spatial and temporal scales • Local • Gaseous • Volume signaling etc.
Theneuronis a computer: Hillock input Output
Frombiologyofbraintoinformationprocessing. • (1) Information is distributed. • (2) Brain works in deterministic mode as well as in stochastic mode (generation of nerve signals, synaptic transmission). • (3) Associative memory: Retrieval not by checking bit by bit, but by gradual reproduction of the whole.
(4) Architecture of brain: - optimizes information transmission. - stable against errors - physical constraints: oxygen, blood, energy, cooling. - Small World Architecture. • (5) 1/f frequency scaling in EEG: fractals, self-similarity. Dynamical origin: Model of self-organized criticality (sand pile analogue). Neural avalanches – do they transmit information? Does the brain work at some critical point? • (6) How is neural connectivity generated? Nerve growth: random process, guided by chemical markers, enhanced by stochastic resonance.
Artificial Neural Networks A network with interactions: An attempt to mimic the brain: • Unit elements are artificial neurons (linear or nonlinear input-output unit). • Communications are encoded by weights, measure of how strong neurons affect each other. • Architectures can be feed-forward, feedback or recurrent.
Firingof agroupofneurons: biologyvsmodel x1 w1: synaptic strength wn xn
Biological motivation for multi-layer feed-forward network: Modeled after visual cortex being muli-layered (6 layers). Dominantly feedforward. There is strong lateral inhibition: Neurons in the same layer don’t talk to each other. High connectivity between neurons (10^4 per neuron) provides basis for massive parallel computing. High redundance against errors.
The principal building blocks: • Input vector x_k • Weight matrix w_ik • Activation function f • Output vector y_i • Dynamical update rule: • Goal: Find weights and activation function such that for given input the output is close to desired output • (Training supervised by teacher)
Truth Table for Logical AND x y x & y 0 0 0 1 1 0 1 1 output inputs The Perceptron (1962): A simplefeed-forwardnetwork:1inputlayer1outputlayer. • The simplest possible artificial neural network able to do a calculation consists from two inputs and one output. It can be used to classify patterns or to perform basic logical operations such as AND, OR and NOT. x 1 x+y-1.5 y 1 (x+y-1.5) 0 1.5 0 0 -1 sum output 1 weights inputs
Σxi wi sum weights inputs * output Perceptron learning algorithm Training any unit consists of adjusting the weight vector and threshold so that desired classification is performed. Algorithm (inspired by Hebb’s rule): For each training vector x evaluate the output y. Take the difference desired output t – current output y. wi=wi+wi. wi=(t-y)xi. Repeat until y=t. Rule: If the sum of the weighted inputs exceeds a threshold, output 1, else output -1. 1 if Σ inputi * weighti > threshold -1 if Σ inputi * weighti < threshold
0 0 0 1 Interpretation of weights Heaviside function has threshold at x=0. Decision boundary given by: a = w*x+ w0 = w0+ w1 x1 + w2 x2 = 0 Thus: x2 = - (w0 + w1 x1)/w2 . 1.5 1.5
Linear discriminant function: separation of 2 classes A linear discriminant function is a mapping which partitions feature space using a linear function (straight line, or hyperplane) D=2 dimensions: decision boundary is straight line Simple form of classifier: “separate two classes using a straight line in feature space”
w11 x1 Separation of K classes y1 wk1 wkd xd yk wk0 -1 • Perceptron can be used to discriminate between k classes by having k output nodes: • x is in class Cj if yj (x)>= yk for all k • Resulting decision boundaries divide the feature space into convex decision regions Weight to output j from input k is wjk yj = g(Swjk xk + wk0) C1 C2 C3
Network training via gradient descend Set of training data from known classes used in conjunction with an error function E(w) (eg. squared difference between target t and response y) which must be minimized. Then: w new = w old - E(w) where: E(w) is a vector representing the gradient and is the learning rate (small, positive) 1. Move downhill in direction E(w) (steepest downhill since E(w) is the direction of steepest increase) 2. Termination is controlled by
Moving along the error function landscape • Equivalent to climbing hill up and down • Problem: when to stop? • Local minima • Possibility of multiple local minima. Note: for single-layer perceptron, E(w) only has a single global minimum - no problem! • Gradient descent goes to the closest local minimum: • General solution: random restarts from multiple places in weight space (simulated annealing).
Training multi-layer NN via back propagation algorithm Back-Propagation
Perceptron as Classifier For d-dimensional data perceptron consists of d-weights, a bias and a thresholding activation function. For 2D data we have: x1 w1 w2 x2 a = w0 + w1 x1 + w2 x2 y=g(a) {-1, +1} Output = class decision 1 w0 1. Weighted Sum of the inputs 2. Pass thru Heaviside function: T(a)= -1 if a < 0 T(a)= 1 if a >= 0 View the bias as another weight from an input which is constantly on If we group the weights as a vector w we therefore have the net output y given by: y = g(w . x + w0)
Successful Footballers Academics Few Hours in the Gym per Week Many Hours in the Gym per Week Unsuccessful Failure of the Perceptron …despite the simplicity of their relationship: Academics = Successful XOR Gym In this example, a perceptron would not be able to discriminate between the footballers and the academics (XOR cannot be represented by single theshold sigma node). This failure caused the majority of researchers to walk away.
Classification: Decision Trees Color = dark Color = light #nuclei=1 #nuclei=2 #nuclei=1 #nuclei=2 cancerous healthy #tails=1 #tails=2 #tails=1 #tails=2 healthy cancerous healthy cancerous
Example of Classification from Neural Networks: “Which factors determine if a cell is cancerous?” Color = dark # nuclei = 1 … # tails = 2 Healthy Cancerous
FeedforwardNN . LetterrecognitionfromwritingonPCscreen. Letterscanner – transformationintoASCIIcode 9 15
SWN topology: Network of 5 Layers by 8 Neurons. Simulation with 5 neurons per layers and 8 layers. NN was trained with 40 patterns for 50 different runs. Learning 40 patterns. The regular network almost fails to learn. With a few short-cuts the network learns well. The SWN architecture is better than regular and random random architecture.
Example: Restoring corrupted memory patterns 20% of T corrupted Half is Corrupted Original T Use in search machines (Alt Vista): Search from incomplete or corrupted items.
Approximate solution of combinatorial optimization problem: Travelling Salesman Problem
Associative memory problem: Store a set of patterns When presenting pattern z, network finds pattern , closest to pattern z.
Hopfield Model • Dynamical variables: Neuron i ( i=1,..,N) represented by variable • At any time t network determined by state vector • { i=1…N}. • Dynamical update rule: • System evolves from a given state to some stable network state (attractor state, fixed point).
Hopfield Nets • Every node is connected to every other node. • Weights are symmetric and wi,j=0. • The flow of information is not unidirectional. • The state of the system is given by the node output.
Training of Hopfield Nets: Inspired by biological Hebb’s rule(unsupervised) • Present components of the patterns to be stored at the outputs of corresponding nodes of the net • If two nodes have the same value then make a small positive increment to internode weight. If they have opposite values then make a small negative decriment to the node
Phase diagram of attractor network Load: alpha=no patterns/no connections per node
b Character recognition For a 256 x 256 character we have 65, 536 pixels. One input for each pixel is not efficient. Reasons: • Poor generalisation: data set would have to be vast to be able to properly constrain all the parameters. • Takes long time to train • Answer: use averages of N2 pixels dimensionality reduction – each average could be a feature.