291 likes | 569 Views
Modular Neural Networks II. Presented by: David Brydon Karl Martens David Pereira. CPSC 533 - Artificial Intelligence Winter 2000 Instructor: C. Jacob Date: 16-March-2000. Presentation Agenda. A Reiteration Of Modular Neural Networks Hybrid Neural Networks Maximum Entropy
E N D
Modular Neural Networks II Presented by: David Brydon Karl Martens David Pereira CPSC 533 - Artificial Intelligence Winter 2000 Instructor: C. Jacob Date: 16-March-2000
Presentation Agenda • A Reiteration Of Modular Neural Networks • Hybrid Neural Networks • Maximum Entropy • Counterpropagation Networks • Spline Networks • Radial Basis Functions • Note: The information contained in this presentation has been obtained from Neural Networks: A Systematic Introduction by R. Rojas.
A Reiteration of Modular Neural Networks There are many different types of neural networks - linear, recurrent, supervised, unsupervised, self-organizing, etc. Each of these neural networks have a different theoretical and practical approach. However, each of these different models can be combined. How ? Each of the afore-mentioned neural networks can be transformed into a module that can be freely intermixed with modules of other types of neural networks. Thus, we have Modular Neural Networks.
A Reiteration of Modular Neural Networks • But WHY do we have Modular Neural Network Systems ? • To Reduce Model Complexity • To Incorporate Knowledge • To Fuse Data and Predict Averages • To Combine Techniques • To Learn Different Tasks Simultaneously • To Incrementally Increase Robustness • To Emulate Its Biological Counterpart
Hybrid Neural Networks • A very well-known and promising family of architectures was developed by Stephen Grossberg. • It is called ART - Adaptive Resonance Theory. • It is closer to the biological paradigm than feed-forward networks or standard associative memories. • The dynamics of the networks resembles learning in humans. • One-shot learning can be recreated with this model. • There are three different architectures in this family: • ART-1: Uses Boolean values • ART-2: Uses real values • ART-3: Uses differential equations
Hybrid Neural Networks Each category in the input space is represented by a vector. The ART networks classify a stochastic series of vectors into clusters. All vectors located inside the cone around each weight vector are considered members of a specific cluster. Each unit fires only for vector located inside it associated ‘cone’ of radius ‘r’. The value ‘r’ is inversely proportional to the attention parameter of the unit. Large ‘r’ means classification of the input space is fine. Small ‘r’ means classification of the input space is coarse.
Hybrid Neural Networks Fig. 1. Vector clusters and attention parameters
Hybrid Neural Networks • Once the weight vectors have been found, the network computes whether new data can or cannot be classified by the existing clusters. • If not, a new a new cluster is created with a new associated weight vector. • ART networks have two major advantages: • Plasticity: it can always react to unknown inputs (by creating a new cluster with a new weight vector, if the given input cannot be classified by existing clusters). • Stability: Existing clusters are not deleted by the introduction of new inputs (New clusters will just be created in addition to the old ones). • However, enough potential weight vectors must be provided.
Hybrid Neural Networks Fig. 2. The ART-1 Architecture
Hybrid Neural Networks The Structure of ART-1 (Part 1 of 2): There are two basic layers of computing units. Layer F1 receives binary input vectors from the input sites. As soon as an input vector arrives it is passed to layer F1 and from there to layer F2. Layer F2 contains elements which fire according to the “winner-takes-all” method. (Only the element receiving the maximal scalar product of its weight vector and input vector fires). When a unit in layer F2 has fired, the negative weight turns off the attention unit. Also, the winning unit in layer F2 sends back a 1 throughout the connection between layer F2 and F1. Now each unit in layer F1 becomes as input the corresponding component of the input vector x and of the weight vector w.
Hybrid Neural Networks The Structure of ART-1 (Part 2 of 2): The i-th F1 unit compares xi with wi and outputs the product xiwi. The reset unit receives this information and also the components of x, weighted by p, the attention parameter so that its own computation is p (x1+x2+…+xn) - x.w 0 which is the same as (x.w) / (x1+x2+…+xn) p The reset unit fires only if the input lies outside the attention cone of the winning unit. A reset signal is sent to layer F2, but only the winning layer is inhibited. This is turns activates the attention unit and a new round of computation begins. Hence, there is resonance.
Hybrid Neural Networks The Structure of ART-1 (Some Final Details): The weight vectors in layer F2 are initialized with all components equal to 1 and p is selected to satisfy 0<p<1. This ensures that eventually an unused vector will be recruited to represent a new cluster. The selected weight vector w is updated by pulling it in the direction of x. This is done in ART-1 by turning of all component in w which are zeros in x. The purpose of the reset signal is to inhibit all units that do not resonate with the input. A unit in layer F2, which is still unused, can be selected for the new cluster containing x. In this way, sufficiently different input data can create a new cluster. By modifying the value of the attention parameter p, we can control the number of clusters and how wide they are.
Hybrid Neural Networks The Structure of ART-2 and ART-3 ART-2 uses vectors that have real-valued components instead of Boolean components. The dynamics of the ART-2 and ART-3 models is governed by differential equations. However, computer simulations consume too much time. Consequently, implementations using analog hardware or a combination of optical and electronic elements are more suited to this kind of model.
Hybrid Neural Networks Maximum entropy So what’s the problem with ART ? It tries to build clusters of the same size, independently of the distribution data. So, is there a better solution ? Yes, Allow the clusters to have varying radii with a technique called the “Maximum Entropy Method”. What is “entropy” ? The entropy H of a data set of N points assigned to k differently clusters c1, c2, c3,…,cn is given by H=- p(c1)log(p(c1)) + p(c1)log(p(c2)) + ... + p(cn)log(p(cn)) where p(ci) denotes the probability of hitting the i-th cluster, when an element of the data set is picked at random. Since the probabilities add up to 1, the cluster that maximizes the entropy is one for which all clusters are identical. This means that the clusters will tend to cover the same number of points.
Hybrid Neural Networks Maximum entropy However, there is still a problem - whenever the number of elements of each class in the data set is different. Consider the case of unlabeled speech data: some phonemes are more frequent than others and if a maximum entropy method is used, the boundaries between clusters will deviate from the natural solution and classify some data erroneously. So how do we solve this problem ? With the “Boostrapped Iterative Algorithm”: cluster: Computer a maximum entropy clustering with the training data. Label the original data data according to this clustering. select: Build a new training set by selecting from each class the same number of points (random selection with replacement). Go to the previous step.
Hybrid Neural Networks Counterpropagation network Are there any other hybrid network models ? Yes, the counter-propagation network as proposed by Hecht-Nielsen. So what are counter-propagation networks designed for ? To approximate a continuous mapping f and it inverse f-1. A counter-propagation consists of an n-dimentional input vector which is fed to a hidden layer consisting of h cluster vectors. The output is generated by a single linear associator unit. The weights in the network are adjusted using supervised learning. The above network can successfully approximate functions of the form f: Rn -> R.
Hybrid Neural Networks Fig. 3 Simplified counterpropagation nework
Hybrid Neural Networks • Counterpropagation network • The training phase is completed in two parts • Training of the hidden layer into a clustering of input space that corresponds to an n-dimentional Voronoi tiling. The hidden layers output needs to be controlled so that only the element with the highest activation fires. • The zi weights are then adjusted to represent the value of the approximation for the cluster region. • This network can be extended to handle multiple output • units.
Hybrid Neural Networks • Fig. 4 Function approximation with a counterpropagation network.
Hybrid Neural Networks • Spline networks • Can the approximation created by a counterpropagation network be improved on? Yes • In the counterpropagation network the Voronoi Tiling, is composed of a series horizontal tiles. Each of which represents an average of the function in that region. • The spline network solves this problem by extending the hidden layer in the counterpropagation network. Each unit is paired with a linear associator, the cluster unit is used to inhibit or activate the linear associator which is connected to all inputs. • This modification allows the resulting set of tiles to be oriented differently with respect to each other. Creating an approximation with a smaller quadratic error, and a better solution to the problem. • Training proceeds as before except the newly added linear associators are trained using back propagation.
Hybrid Neural Networks • Fig. 5 Function approximation with linear associators
Hybrid Neural Networks • Radial basis functions • Has a simular structure as that of the counter propagation network. The difference is in the activation function used for each unit is Gaussian instead of Sigmoidal. • The Gaussian approach uses locally concentrated functions. • The Sigmodal approach uses a smooth step approach. • Which is better depends on the specific problem at hand. If the function is smooth step then the Gaussian approach would require more units, where if the function is Gaussian then the Sigmodal approach will require more units.