200 likes | 312 Views
Heterogeneous adaptive systems. W ł odzis ł aw Duch & Krzysztof Gr ą bczewski Department of Informatics, Nicholas Copernicus University, Torun, Poland. http://www.is.umk.pl. Why is this important?. MLPs are universal approximators, best choice?
E N D
Heterogeneous adaptive systems Włodzisław Duch & Krzysztof Grąbczewski Department of Informatics, Nicholas Copernicus University, Torun, Poland. http://www.is.umk.pl
Why is this important? MLPs are universal approximators, best choice? Wrong bias => poor results, complex networks. No single method may achieve best results for all datasets. 2-class problems, two situations: Class 1 inside the sphere, Class 2 outside. MLP: at least N +1 hyperplanes, O(N2) parameters. RBF: 1 Gaussian, O(N) parameters. C1 in the corner defined by (1,1 ... 1) hyperplane, C2 outside. MLP: 1 hyperplane, O(N) parameters. RBF: many Gaussians, O(N2) parameters, poor approx. Combination: needs both hyperplane and hypersphere!
Inspirations Logical rule: IF x1>0 & x2>0 THEN Class1 Else Class2 is not properly represented neither by MLP nor RBF! Result: decision trees and logical rules perform on some datasets (cf. hypothyroid) significantly better than MLPs! Speed of learning+network complexity depends onTF. Fast learning requires flexible „brain modules” - TF. • Biological inspirations: sigmoidal neurons are crude approximation to the lowest neural level. • Interesting brain functions are done by interacting minicolumns, implementing complex functions. • Human categorization: never so simple. • Modular networks: networks of networks. • First step beyond single neurons: transfer functions providing flexible decision borders.
Heterogeneous systems Homogenous systems: one type of “building blocks”, same type of decision borders. Ex: neural networks, SVMs, decision trees, kNNs …. Committees combine many models together, but lead to complex models that are difficult to understand. Discovering simplest class structures, its inductive bias: requires heterogeneous adaptive systems (HAS). Ockham razor: simpler systems are better. HAS examples: NN with many types of neuron transfer functions. k-NN with different distance functions. DT with different types of test criteria.
TF in Neural Networks Choices with selection of optimal functions: • Homogenous NN: select best TF, try several typesEx: RBF networks; SVM kernels (may give 50=>80% change). • Heterogeneous NN: one network, several types of TF. Ex: Adaptive Subspace SOM (Kohonen 1995), linear subspaces.Projections on a space of various basis functions. • Input enhancement: adding fi(X) to achieve separability. Ex: functional link networks (Pao 1989), tensor products of inputs; D-MLP model. Heterogeneous: 1. Start from large network with different TF, use regularization to prune 2. Construct network adding nodes selected from a pool of candidates 3. Use very flexible TF, force them to specialize.
Most flexible TFs Conical functions: mixed activations Lorentzian: mixed activations Bicentral - separable functions
Optimal Transfer Function network OTF-NN, based on IncNet ontogenic network architecture (N Jankowski), statistical criteria for pruning/growth + Kalman filter learning. XOR solution with: 2 Gaussian functions 1 Gaussian + 1 sigmoidal function 2 sigmoidal functions. 1 Gaussian with G(W.X) activation.
OTF for half sphere/subspace 2D and 10D problem considered, 2000 points. OTF starts with 3 Gaussian + 3 sigmoidal f. 2-3 neuron solutions found, 97.5-99% accuracy. Simplest solution: 1 Gaussian + 1 sigmoid 3 sigmoidal functions – acceptable solution.
Heterogeneous FSM Feature Space Mapping: neurofuzzy ontogenic network, selects a separable localized transfer function from a pool of several types of functions. Rotated halfspace + Gauss Simplest solution found: 1 Gaussian + 1 rectangular function. In 5D and 10D needs many points.
Similarity-based HAS Local distance functions optimized differently in different regions of feature space. Weighted Minkovsky distance functions: Ex: a=20 and other types of functions, including probabilistic functions, changing piecewise linear decision borders. RBF networks with different transfer function; LVQ with different local functions.
HAS decision trees Decision trees select the best feature/threshold value for univariate and multivariate trees: Decision borders: hyperplanes. Introducing tests based on La Minkovsky metric. For L2 spherical decision border are produced. For L∞ rectangular border are produced. Many choices, for example Fisher Linear Discrimination decision trees.
SSV HAS DT Define left and right areas for test T with threshold s: Count how many pairs of vectors from different classes are separated and how many vectors from the same class are separated.
SSV HAS algorithm Compromise between complexity/flexibility: • Use training vectors for reference R • Calculate TR(X)=D(X,R) for all data vectors, i.e. the distance matrix. • Use TR(X) as additional test conditions. • Calculate SSV(s) for each condition and select the best split. Different distance functions lead to different decision borders. Several distance functions are used simultaneously. 2000 points, noisy 10 D plane, rotated 45o, + half-sphere centered on the plane. Standard SSV tree: 44 rules, 99.7% HAS SSV tree (Euclidean): 15 rules, 99.9%
SSV HAS Iris Iris data: 3 classes, 50 samples/class. SSV solution with the usual conditions (6 errors, 96%), or with distance test using vectors from a give node only: if petal length < 2.45 then class 1 if petal length > 2.45 and petal width < 1.65 then class 2 if petal length > 2.45 and petal width > 1.65 then class 3 SSV with Euclidean distance tests using all training vectors as reference (5 errors, 96.7%) 1. if petal length < 2.45 then class 1 2. if petal length > 2.45 and ||X-R15|| < 4.02 then class 2 3. if petal length > 2.45 and ||X-R15|| > 4.02 then class 3 ||X-R15|| is the Euclidean distance to the vector R15.
SSV HAS Wisconsin Wisconsin breast cancer dataset (UCI)699 cases, 9 features (cell parameters, 1..10)Classes: benign 458 (65.5%) & malignant 241 (34.5%). Single rule gives simplest known description of this data: IF ||X-R303|| < 20.27 then malignant else benign 18 errors, 97.4% accuracy. Good prototype for malignant! Simple thresholds, that’s what MDs like the most! Best L1O error 98.3% (FSM), best 10CV around 97.5% (Naïve Bayes + kernel, SVM) C 4.5 gives 94.7±2.0% SSV without distances: 96.4±2.1% Several simple rules of similar accuracy are created in CV tests.
Conclusions Heterogeneous systems are worth investigating. Good biological justification of HAS approach. Better learning cannot repair wrong bias of the model.StatLog report: large differences of RBF and MLP on many datasets. Networks, trees, kNN should select/optimize their functions. Radial and sigmoidal functions in NN are not the only choice. Simple solutions may be discovered by HAS systems. Open questions: How to train heterogeneous systems? Find optimal balance between complexity/flexibility? Ex. complexity of nodes vs. interactions (weights)? Hierarchical, modular networks: nodes that are networks themselves.
The End ? Perhaps still the beginning ...