550 likes | 766 Views
Neural Coding. What Kind of Information is Represented in a Neural Network and How ?. Outline. ANN Basics Local -vs- Distributed Representations The Emergence/Learning of Salient Features in Neural Networks Feed-Forward Neural Networks Embody Mappings Linearly Separable Mappings
E N D
Neural Coding What Kind of Information is Represented in a Neural Network and How?
Outline • ANN Basics • Local -vs- Distributed Representations • The Emergence/Learning of Salient Features in Neural Networks • Feed-Forward Neural Networks Embody Mappings • Linearly Separable Mappings • Classification in Spaces that are NOT Linearly Separable • Coding Heuristics • Hopfield Networks • Summary
NeuroPhysiology Neurons Synapses Nucleus Axon Dendrites • Dense: Human brain has 1011 neurons • Highly Interconnected: Human neurons have 104 fan-in. • Neurons firing: send action potentials (APs) down the axons when • sufficiently stimulated by SUM of incoming APs along the dendrites. • Neurons can either stimulate or inhibit other neurons. • Synapses vary in transmission efficiency • Development: Formation of basic connection topology • Learning: Fine-tuning of topology + Major synaptic-efficiency changes. • The matrix IS the intelligence!
wi wj wk NeuroComputing • Nodes fire when sum (weighted inputs) > threshold. • Other varieties common: unthresholded linear, sigmoidal, etc. • Connection topologies vary widely across applications • Weights vary in magnitude & sign (stimulate or inhibit) • Learning = Finding proper topology & weights • Search process in the space of possible topologies & weights • Most ANN applications assume a fixed topology. • The matrix IS the learning machine!
In Out In Out Excitatory & Inhibitory Arcs in the Clique Maxnet: Clique = only inhibitory arcs In Out Tasks & Architectures • Supervised Learning • Feed-Forward networks • Concept Learning: Inputs = properties, Outputs = classification • Controller Design: Inputs = sensor readings, Outputs = effector actions • Prediction: Inputs = previous X values, Outputs = predicted future X value • Learn proper weights via back-propagation • Unsupervised Learning • Pattern Recognition • Hopfield Networks • Data Clustering • Competitive Networks
x1 Node Types 1 w1 x2 j 2 xj = fT(netj) w2 • Most ANNs use nodes that sum the weighted inputs. • But many types of transfer functions, fT are used. • Thresholded (Discontinuous) • Step • Ramp • Non-thresholded (Continuous, Differentiable) • Linear • Sigmoid xn n wn
Step Ramp xj xj xj xj netj netj netj netj Sigmoidal Linear Transfer Functions • Step functions are useful in classifier nets, where data partitioning is important. • Linear & Sigmoidal are everywhere differentiable, thus popular for backprop nets. • Sigmoidal has most biological plausibility.
Learning = Weight Adjustment wj,i xi zj xj • Generalized Hebbian Weight Adjustment: • The sign of the weight change = the sign of the correlation between xi and zj: ∆wji xizi • zj is: • xj Hopfield networks • dj - xj Perceptrons (dj = desired output) • dj - ∑xiwji ADALINES “ “ i
Cellular Automata Step N+1 Step N Update rule: If exactly 2 red neighbors, change to red; else change to green. • Distributed Representations: Picture Copying • Update rule: If an odd number of neighbors are on, turn on, else turn off. • In CA’s and ANNs, you need to learn to think differently about representation!
Local -vs- Distributed Representations • Assume examples/concepts have 3 features: • Age : {Young, Middle, Old} • Sex: {Male, Female} • Marital Status: {Single, Samboer, Married} Young, Married Female! Young, Single, Male! Old, Female Samboer! Old Female! Samboer! Distributed: Together they rep a conjunctive concept, but the individual conjuncts cannot necessarily be localized to single neurons Semi-Local: Together they rep a conjunctive concept, and each neuron reps one or a few conjuncts - i.e. concept broken into clean pieces. Local: One neuron represents an entire conjuctive concept.
Local -vs- Distributed (2) • Size requirements to represent the whole set of 18 3-feature concepts - assuming binary neurons (on/off) • Local: 3x3x2 = 18 • Instance is EXACTLY 1 of 18 neurons being on. • Semi-Local: 3+3+2 = 8 (Assume one feature value per neuron) • Instance is EXACTLY 3 of 18 neurons being on. • Distributed: log2 18 = 5 • Instance is any combination of on/off neurons • Add 1 bit and DOUBLE the representational capacity, so each concept can be represented by 2 different codes (redundancy). • The same neural network (artificial or real) may have different types of coding in different regions of the network. Young Old Single Married Male Female Young, Married Female! +5 +1 +3 Semi-Local => Local
Representational Hierarchies • In the brain, neurons involved in early processing are often semi-local, while neurons occuring later along the processing path (i.e. higher level neurons), are often local. • In simpler animals, there appears to be a lot of local coding. In humans, it is still debatable. Line tilted 45o @ {3o,28o} Dark dot @ {3o,28o} Grandma!! Human Face
Vector Coding • An organism’s sensory apparatus uses vector coding as a representation of its inputs. • Semi-local coding, since the components of a conjunctive concept are localized to individual neurons. • A particular color, flavor, sound, etc. = a vector of receptor states (not a single receptor state). • Combinatorics: nk possible vector states, k = # receptors, n = # possible receptor states. Note: n > 2 in many cases. • The fact that humans are much better at disciminating sensory inputs than actually describing them illustrates the relative density of sensory vector space -vs- the sparseness of language. 0.1 0.8 0.2 0.9 ``Tyrkisk Pebel´´ Sour Salty Tongue Bitter Sweet
Comparison of Coding Forms • Compact Representation: Local (NO!), Distributed (YES!) • Graceful Degredation (Code works when a few neurons are faulty): Local (NO!), Distributed (Yes- due to redundancy). • Binding Problem (How to represent two concepts that occur simultaneously): Local (EASY! - two active nodes), Distributed (HARD - but may be possible by quick shifts back and forth between the 2 activation patterns) E.g. “Where’s Waldo”: Easy to pick out a human face among a bunch of round objects, or your mother’s face among a bunch of other faces, thus indicating that we probably have relatively local codes for these all-important concepts. But, it’s VERY HARD to find Waldo (i.e. a generic-faced cartoon man with a red-and-white striped shirt) in a crowd of several hundred generic cartoon characters wearing all sorts of colors & patterns. Why? “Red-and-white stripes” is probably not locally coded in the human brain and hence not quickly/effortlessly detected. It probably shares neurons with concepts such as “stripe” “red”, “white”, etc. • In more complex animals, all 3 coding forms are probably present, with local for the most salient concepts for that organism.
Species-Specific Saliency • The key stimuli for an organism are often locally or semi-locally encoded, with direct connections from the detector neuron(s) to a motor (action-inducing) neuron. The movement of this simple pattern ressembles a hawk and scares small chickens. The movement of the reverse pattern ressembles a goose and elicits no response from the chicks.
Fish Dinner • Three-spined sticklebacks respond to these simple stimuli: • But not these: • Salient feature: Red belly!
# Turns Length of stimulus T5(2) Firing rate Length of stimulus Toad Turn-ons • The behavioral response (i.e. number of times that it turns around per minute) of a toad as a function of the length of the stimulus is mirrored by the firing rates of neurons in the T5(2) region of its brain. Worm Anti-Worm Square
Emergent Salience • Animal bodies and brains have evolved to maximize the odds of survival and reproduction (i.e., fitness). Both are tailored to the survival task at hand. • Hence salient features will emerge (via evolution and learning) as the activating conditions for various neurons. When fired, those neurons will then help to initiate the proper (motor) response to a salient input. • Similarly, if an ANN is given a task and the ability to adapt (i.e. learn and/or evolve), the salient features of that task will emerge as the activating conditions for hidden-layer and output neurons. • Salient features can then be read off the input weights to those neurons. • So, the only features that need to be given to the ANN are the very primitive ones at the input layer. The rest are discovered!
Face Recognition • Animals differ as to their abilities to disciminate sounds, tastes, smells, colors , etc. • Humans are very good at disciminating faces, at least faces of the type that they grow up around. • Hypothesized # dimensions in face-coding space = 20 (Churchland) Face Space Morphing Pg. 34 Pg. 28 Choose evenly-spaced points along the vector that connects the source & target faces
ANN for Face Recognition • Garrison Cottrell et. al. (1991) • Feed-forward net with backprop learning Pg. 40
Training & Testing • Training: 64 photos of 11 different faces + 13 non-face photos • Performance Criteria: Classify each picture as to: • face or non-face? • male or female? • Name? • Results: • Training Accuracy: 100% • Test with same faces but new pictures: 98% • Test with new faces : 100% (face-non-face?), 81% (male-female?) • Test with known face but with 20% of picture erased: • Vector completion: the firing patterns of middle-layer neurons are very similar to those patterns when the non-erased image is presented. Hence, in its understanding of the pictures, the ANN fills in the missing parts. • Generally good performance, but erased foreheads caused problems (71% recognition). • Holons: Middle-layer nodes represent generic combi-faces instead of individual features.
+2 +6 A +1 B +7 -3 +5 C Combi-Faces (Holons) at Hidden Nodes Incoming weights to a node indicate what it “prefers”: • Likes eyes at positions shown • Has slight preference for noses right below and between eyes. • Prefers smiles over frowns • “Turned on” by sexy movie-star cheek moles
Node B’s Dream Face Darker color => Higher preference Similar methods for interpreting the concepts represented by ANN nodes.
Facial Holons Pg. 48 • Prefered stimuli: By looking at the signs of the input weights to a hidden node, we can construct a prototypical input vector that the node would fire on. E.g. If wji > 0, then xi > 0 is desired, and if wji < 0, then xi < 0 is desired. • Doing this for each of the 80 hidden nodes of the face net yields an interesting set of hybrid faces as prefered stimuli. • Enhanced robustness: since recognition of particular features is now spread over many hidden nodes/holons, the network can still successfully recognize faces if a node or two are inoperable. Each input case satisfies a subset of the 80 holons. I.e., each input case is a combination of holons
How Realistic is it? • Anatomical: • In the brain, 5 levels of synapses connected the retina to the (known) region of face coding. • But, those 5 levels perform many other tasks too. • Functional: • ANNs trained with many more Asian than Caucasian faces were much better at discriminating the latter than the former. • ``They all look alike’’ is result of past experiences and their effects upon the observer’s neural development, not any objective differences in homogeneity within the different races. • Similar ANNs were also trained to recognize emotional states in the faces. • Results were promising (~80% accuracy on test phase), but the acting ability of the student subjects was very poor, so better results can be expected. • Emotion recognition is a VERY important aspect of human social behavior.
Neural Nets as Mappings Range • The main application of feed-forward ANNs is to learn a general function (mapping) between a particular domain (D) and range (R) when given a set of examples: {(d, r): d in D, r in R}. • D and R may contain vectors or scalars Domain F r1 d1 d2 r2 d3 r3 d4 Example set = {(d1,r3), (d2,r1), (d3,r2), (d4,r2)} Goal: Build an ANN that can take ANY element d of D on its input layer and produce F(d) on its output layer. Problem: The example set normally represents a very small fraction of the complete mapping set (which may be infinite).
Brain Output vector of desired muscle activation levels Input vector Muscles Senses Sensorimotor Coordination: Mapping Sensations to Actions • Intelligent Physical Behavior: Performance of the proper motor movements in response to the current sensory stimuli. • A large and well-defined brain is just evolution’s latest and highest achievement in sensorimotor coordination, not its earliest or only example…Churchland, pg. 95-6 • Vector processing: Transformation of sensory input vectors into motor output vectors • Coordinated Behavior: Proper sequence of muscle-activations = proper trajectory in output-vector space. XC Ski Walk Run
ANN for Crab Control • Simple feed-forward net that maps points in visual space to points in claw-angle space. • 93% accurate • Simple, one-shot movement- assumes muscles snap into proper position. Pg. 94 picture
Bear Sheep Horse Classification = Mapping • M: Features => Classes In Out Weight Hidden Hibernate? Habitat Max Speed Coat type
1 3 -3 -1 -5 2 2 +1 1 3 -3 -1 1 9 3 +1 -2 4 1 +1 -7 2 4 +1 5 5 -5 -1 Classification sum* class case x y X Y Wx= -1 1 2 3 4 1 Wy = 1 Wz =-5 0 5 6 7 The perceptron should compute the proper class for each input x-y pair. For a single perceptron, this is only possible when the input vectors are linearly separable
X X X X Y Y Y X Y .5 -.5 .5 -.5 -.5 1 1 1 1 1 .5 .5 -.5 -.5 and or ~or ~and -.3 .3 -.8 .8 not 0 Simple Boolean Functions True: +1 False: -1
.5x + .5y + .3 > 0 <=> x + y > -.6 <=> y > -x - .6 .5x + .5y -.8 > 0 <=> x + y > 1.6 <=> y > -x + 1.6 Linear Separability of Booleans AND Y 1 - + X 1 -1 -1 - - OR Y 1 + + X 1 -1 - -1 +
X Y Y .5 .5 1 + - 1 1 -.5 -.5 and and -.8 -.8 X .5 .5 1 -1 1 or - -1 + .3 XOR *Not linearly separable => More than 1 perceptron is needed All boolean functions can be represented by a feed-forward ANN with 2 layers or less. Proof: All boolean functions can be expressed as a conjunction of disjunctions (CNF) => Disjuncts = layer 1& the conjunct = layer 2
Y - 10 - + - - - + + X 10 -10 - + + - -10 + + Linear Separability of Reals y < x - 3 <=> y - x + 3 < 0 <=> x - y - 3 > 0 X Y 1 This outputs a 1 for all pos instances, and a -1 for all neg instances 1 -1 f(x,y) -3 *When one hyperplane separates all pos from neg examples, then a single perceptron can be the classifier
Separable by N Hyperplanes L3 b L1 L2 + + Y - a 10 + - - c + - X + + 10 -10 - + - - -10 Classification of positive instances: C1: Above L1 & Below L2 OR C2: Above L1 & Above L3 OR C3: Below L1 & Above L2 & Below L3 L1: y = x L2: y = -x + 5 L3: y = - 4x + 30
X Y X Y X Y 1 -1 1 1 1 1 1 -1 1 0 1a 0 1b -5 2a X Y 1 -1 -1 5 2b X X Y Y 1 1 4 -4 -1 1 -30 30 3b 3a ANN Component Nodes 1a. Above L1: y > x <=> y - x > 0 1b. Below L1: y < x <=> x - y > 0 2a. Above L2: y > -x+ 5 <=> y + x - 5 > 0 2b. Below L2: y < -x + 5 <=> -x - y + 5 > 0 3a. Above L3 y > -4x + 30 <=> y + 4x - 30 > 0 3b. Below L3: y < -4x + 30 <=> - 4x - y + 30 > 0
The Complete ANN X Y -1 -1 1 1 -1 1 1 -4 3b (-30) 1 -1 -1 4 3a (30) 1a (0) 1b (0) 2a (5) 2b (-5) Hyperplanes 1 1 1 C3 (2.5) 1 C2 (1.5) 1 C1 (1.5) 1 1 ANDs 1 1 1 f(x,y) (-1.5) OR
A Simpler ANN X Assume: Below Li replaced by Not-Above-Li Y -1 1 1 1 1 4 3a (30) Hyperplanes 1a (0) 2a (5) -1* 1 1 C3 (2.5) -1* C2 (1.5) 1 C1 (1.5) 1 -1* ANDs 1 1 1 f(x,y) (-1.5) OR
xj netj Sigmoidal Y - 10 - + - - - + + X 10 -10 - + + - -10 + + Sigmoidals & Linear Separability • Using a sigmoidal transfer function (which is non-linear) does not drastically change the nature of linear-separability analysis. • It just introduces a wider linear separator (a “gray area”) which only creates problems when points lie within it. • So, if a set of points are not linearly separable using linear threshold transfer functions, then adding nonlinear sigmoidal transfer functions will not help! Given an X, this S outputs higher values (blue) for lower values of Y. Linear: x - y - 3 > 0 Sigmoidal: S(x-y-3)
Hidden Layer Design Decisions • Number of Hidden Layers & Nodes • Too few => Can’t partition data properly • Too many => Partitions are too detailed => over-specialized for the training set => Can’t generalize to handle new cases. Inputs Points in Space Step Functions Hyperplanes Ands Convex Regions Groups of Regions Ors
y x wyx Age [-1 1] Input Encoding for Feed-Forward Networks • Reals => scaled values in [0 1] or [-1 1] • Colors => pixel intensities => scaled values in [0 1] • Symbols => integers => ” ” ” ” • (small, medium, large) => (.2 .5 .8) • Number of input nodes per input vector element: • One node per element • One node per discrete subrange of the element’s possible values No matter how we choose wyx, node y is forced to treat old age inversely to the way it treats youth. In fact, it must treat all ages in a linear fashion, since there’s only 1 weight relating all ages to y.
wyx1 Young [-1 1] wyx2 Middle [-1 1] Age y x1 x2 x3 wyx3 Old [-1 1] Input Encodings (2) • With discrete classes for an input element, nodes in the next layer are free to treat different ranges of inputs in different (possibly non-linear) ways, since the incoming arcs from each input class can have different weights. • So if wyx1 = 0, wxy2 = 5 and wyx3 = 1, node y is very sensitive to middle age, mildly sensitive to old age, and insensitive to youth. This would be a useful discrimination to make when diagnosing job-related stress, for example.
WLI I L C WLC [0 1] –vs- [-1 1] Example: • I = Yearly Income (scaled to [0 1] or [-1 1]) • C = Credit history ” ” ” denotes bad(untrustworthy) or good. • L = Should the person be given a loan: Yes = 1, No = 0 or -1 • Assume L fires (and outputs a 1) if its weighted sum of inputs is 1. • Assume a customer has a bad credit history (i.e. Has not paid back a few loans). • Assume WLC = WLI = +1, which makes intuitive sense, since both should contribute positively to the loan decision. • If Bad credit => C = 0, then L can still fire if I = 1. • If Bad credit => C = -1, then L cannot fire. • So by using –1 (instead of 0) as the lower bound, the left end of the scale can have a strong influence on the excitation (if the connecting weight is negative) or inhibition (if that weight is positive) of the downstream node. In short, both ends of the scale have similar (but opposite) effects upon the downstream node.
Output Encodings • Similar to Input encodings • 1-n encoding a key issue • More weights to train • But greater discriminability • Take account of the range of fT of the output nodes. • E.g. Sigmoids output values in (0 1)
Mapping Thoughts to Actions in the Brain • The cerebellum, which controls a good deal of motor activity, has a feed-forward structure with few backward (i.e., recurrent) connections. • The cerebrum sends commands to initiate action, which are fed forward from mossy to granule to parallel to Purkinje and out to motor neurons. Parallel Fibers *Arrows denote signal direction Granule Cell Purkinje Cell Thought Cerebral Neocortex Mossy Fiber Climbing Fiber To motor cortex (Action!) From inferior olive
Pyramidal Cells A B Motor Neurons Distributed Coding in the Motor Cortex • Cortical area # 4 = The Motor Cortex (M1) • Pyramidal cells in M1 get inputs from the cortex & thalamus; they send outputs to motor neurons. • But pyramidals => motor neurons is an N-N mapping. • So during any particular movement, MANY pyramidal and motor neurons are firing. I.e. Movement coding is DISTRIBUTED across the pyramidal cells. A B Firing Rate Motion Angle
Inputs Outputs Associative-Memory Networks Input: Pattern (often noisy/corrupted) Output: Corresponding pattern (complete / relatively noise-free) Process • Load input pattern onto core group of highly-interconnected neurons. • Run core neurons until they reach a steady state. • Read output off of the states of the core neurons. Output: (1 -1 1 -1 -1) Input: (1 0 1 -1 -1)
wi wj wk Distributed Information Storage & Processing • Information is stored in the weights with: • Concepts/Patterns spread over many weights, and nodes. • Individual weights can hold info for many different concepts
Hebb’s Rule Connection Weights ~ Correlations ``When one cell repeatedly assists in firing another, the axon of the first cell develops synaptic knobs (or enlarges them if they already exist) in contact with the soma of the second cell.” (Hebb, 1949) In an associative neural net, if we compare two pattern components (e.g. pixels) within many patterns and find that they are frequently in: a) the same state, then the arc weight between their NN nodes should be positive b) different states, then ” ” ” ” negative Matrix Memory: The weights must store the average correlations between all pattern components across all patterns. A net presented with a partial pattern can then use the correlations to recreate the entire pattern.
Correlated Field Components • Each component is a small portion of the pattern field (e.g. a pixel). • In the associative neural network, each node represents one field component. • For every pair of components, their values are compared in each of several patterns. • Set weight on arc between the NN nodes for the 2 components ~ avg correlation. a a ?? ?? b b Avg Correlation wab a b