k-separability

k-separability Włodzisław Duch Department of InformaticsNicolaus Copernicus University, Torun, Poland School of Computer Engineering, Nanyang Technological University, Singapore Google: Duch ICANN 2006, Athens

Plan • What is the greatest challenge of CI? Learning all that can be learned! • Surprise: our learning methods are not able to learn most (approximate) logical functions! • SVMs, kernels and all that. • Neural networks. • Learning dynamics. • New goal of learning: k-separability • How to learn any function?

GhostMiner Philosophy GhostMiner, data mining tools from our lab + Fujitsu: http://www.fqspl.com.pl/ghostminer/ • Separate the process of model building (hackers) and knowledge discovery, from model use (lamers) => GhostMiner Developer & GhostMiner Analyzer • There is no free lunch – provide different type of tools for knowledge discovery: decision tree, neural, neurofuzzy, similarity-based, SVM, committees. • Provide tools for visualization of data. • Support the process of knowledge discovery/model building and evaluating, organizing it into projects. • We are building completely new tools ! Surprise! Almost nothing can be learned using such tools!

Easy problems • Linearly separable in the original feature space: linear discrimination is sufficient (always worth trying!). • Simple topological deformation of decision borders is sufficient – linear separation is then possible in extended (higher dimensional) or transformed spaces. • This is frequently sufficient for pattern recognition problems. • RBF/MLP networks with one hidden layer also solve such problems easily, but convergence/generalization for anything more complex than XOR is problematic. SVM adds new features to “flatten” the decision border: achieving larger margins/separability in the X+Z space.

Difficult problems Complex logic, disjoint “sandwiched” clusters. Networks with localized functions (RBF) need exponentially large number of nodes and do not generalize. Networks with non-local functions (MLP) may also need large number of nodes and will have problems with convergence. Continuous deformation of decision borders is not sufficient to achieve separability (or near-separability). This is typical in logical problems, AI reasoning, real perhaps in perception, object recognition, text analysis, bioinformatics ... Boolean functions: for n bits there are K=2n binary vectors Bi that can be represented as vertices of n-dimensional hypercube. Each Boolean function is identified by K bits. BoolF(Bi) = 0 or 1 for i=1..K, for 2KBoolean functions. Ex: n=2 functions, vectors {00,01,10,11}, Boolean functions {0000, 0001 ... 1111}, are identified by decimal numbers 0 to 15.

Lattice projection of a hypercube is defined in W1=[1,1,..1] direction and orthogonal direction W2 that maximizes separation of the projected points with fixed number of 1 bits. Lattice projection for n=3, 4 Projection on 111 ... 111 gives clusters with 0, 1, 2 ... n bits.

Boolean functions n=2, 16 functions, 12 separable, 4 not separable. n=3, 256 f, 104 separable (41%), 152 not separable. n=4, 64K=65536, only 1880 separable (3%) n=5, 4G, but only << 1% separable ... bad news! Existing methods may learn some non-separable functions, but most functions cannot be learned ! Example: n-bit parity problem; all nearest neighbors are from the opposite class! Many papers on learning parity in top journals. For all parity problems SVM is below base rate! No off-the-shelf systems are able to solve such problems. Such problems are solved only by special neural architectures or special classifiers – if the type of function is known. Ex: parity problems are solved by

What feedforward NN really do? Vector mappings from the input space to hidden space(s) and to the output space. Hidden-Output mapping done by perceptrons. A single hidden layer case is analyzed below. T= {Xi}training data, N-dimensional. H = {hj(Xi)}Ximage in the hidden space, j=1 .. NH-dim. Y = {yk{h(Xi)}Ximage in the output space, k=1 .. NC-dim. ANN goal: scatterograms of T in the hidden space should be linearly separable; internal representations will determine network generalization capabilities and other properties.

What happens inside? Many types of internal representations may look identical from outside, but generalization depends on them. • Classify different types of internal representations. • Take permutational invariance into account: equivalent internal representations may be obtained by re-numbering hidden nodes. • Good internal representations should form compact clusters in the internal space. • Check if the representations form separable clusters. • Discover poor representations and stop training. • Analyze adaptive capacity of networks. • .....

RBF for XOR Is RBF solution with 2 hidden Gaussians nodes possible? Typical architecture: 2 input – 2 Gauss – 2 linear. Perfect separation, but not a linear separation! 50% errors. Single Gaussian output node solves the problem. Output weights provide reference hyperplanes (red and green lines), not the separating hyperplanes like in case of MLP. Output codes (ECOC): 10 or 01 for green, and 00 for red.

3-bit parity For RBF parity problems are difficult; 8 nodes solution: 1) Output activity; 2) reduced output, summing activity of 4 nodes. 3) Hidden 8D space activity, near ends of coordinate versors. 4) Parallel coordinate representation. 8 nodes solution has zero generalization, 50% errors in tests.

3-bit parity in 2D and 3D Output is mixed, errors are at base level (50%), but in the hidden space ... quite easy to separate, but not linearly. Conclusion: separability is perhaps too much to desire ... inspection of clusters is sufficient for perfect classification; add second Gaussian layer to capture this activity; just train second RBF on this data (stacking)!

Goal of learning Linear separation is a good goal only if simple topological deformation of decision borders is sufficient; then neural networks or SVMs may solve the problem. Difficult problems: disjoint clusters, complex logic. Linear separation is difficult, set an easier goal. Linear separation: projection on 2 half-lines in the kernel or hidden activity space: line y=WX, with y<0 for class – and y>0 for class +. Simplest extension: separation into k-intervals instead of 2; other extensions: projection on a chessboard or other pattern. For parity: find direction W with minimum # of intervals, y=W.X

Some remarks Biological justification: • neuron is not a good model for elementary processor; • cortical columns may learn to stimuli with complex logic react resonating in different way; • the second column will learn without problems that such different reactions have the same meaning. Algorithms: • Some results on perceptron capacity for multistep neurons are known and projection on a line may be achieved using such neurons, but each step is a different class, known targets. • k-rep learning is not a multistep output neuron, targets are not known, same class vectors may appear in different intervals! • We need to learn how to find intervals and how to assign them to classes; new algorithms are needed to learn it!

k-sep learning Try to find lowest k with good solution, start from k=2. • Assume k=2 (linear separability), try to find good solution; • if k=2 is not sufficient, try k=3; two possibilities are C+,C-,C+ and C-, C+, C-this requires only one interval for the middle class; • if k<4 is not sufficient, try k=4; two possibilities are C+, C-, C+, C-and C-, C+, C-, C+this requires one closed and one open interval. Network solution is equivalent to optimization of specific cost function. Simple backpropagation solved almost all n=4 problems for k=2-5 finding lowest k with such architecture!

s(W.X+q1) X1 +1 y=W.X +1 X2 s(W.X+q2) +1 -1 X3 +1 +1 +1 X4 -1 s(W.X+q4) Neural k-separability Can one learn all Boolean functions? Problems may be classified as 2-separable (linear separability); non separable problems may be broken into k-separable, k>2. Neural architecture for k=4 intervals. Will BP learn? Blue: sigmoidal neurons with threshold, brown – linear neurons.

Even better solution? What is needed to learn Boolean functions? • cluster non-local areas in the X space, use W.X • capture local clusters after transformation, use G(W.X-q) SVM cannot solve this problem! Number of directions W that should be considered grows exponentially with size of the problem n. Constructive neural network solution: • Train the first neuron using G(W.X-q) transfer function on whole data T, capture the largest pure cluster TC . • Train next neuron on reduced data T 1=T-TC • Repeat until all data is handled; they creates transform. X=>H • Use linear transformation H => Y for classification. Tests: ongoing, but look well!

Summary • Difficult learning problems arise when non-connected clusters are assigned to the same class. • No off-shelf classifiers are able to learn difficult Boolean functions. • Brain have no problem to learn such mapping, ex. different character representation to the same speech sound: a, A , あ, ア, a, Զ, א, அ and others ... • Visualization of activity of the hidden neurons shows that frequently perfect but non-separable solutions are found despite base-rate outputs. • Linear separability is not the best goal of learning. • Other targets that allow for easy handling of final non-linearity should be defined.

Summary cont. • Simplest extension is to isolate non-linearity in form of k intervals. • k-separability allows to break non-separable problems into well defined classes; quite different than VC theory. • For Boolean problems k-separability finds simplest data model with linear projection and k parameters defining intervals. • k-separability may be used in kernel space; many algorithms should be explored. • Tests with simplest backpropagation optimization learned difficult Boolean functions. Prospects for systems that will learn all Boolean functions are good!

Thank youfor lending your ears ... Google: Duch => Papers

k-separability

k-separability

Presentation Transcript

Linear Separability

Separability

Beyond Linear Separability

Separability —Section 10.5

K K K

Neural Separability

K K K K

K k

Introducing the Separability Matrix for ECOC coding

Separability and Topology Control of Quasi Unit Disk Graphs

K,k,

Separability and entanglement: what symmetries and geometry can say

K k

K k

K k

SECTION 7 - SEPARABILITY

K&K Bakery