Integrated Learning in Multi-net Systems

Integrated Learningin Multi-net Systems Neural Computing Group Department of Computing University of Surrey Matthew Casey 6th February 2004 http://www.computing.surrey.ac.uk/personal/st/M.Casey/

Introduction • In-situ learning in multi-net systems • Classification • Parallel combination of networks: ensemble • Sequential combination of networks: modular • Simulation • Parallel and sequential combination of networks • Quantification and addition • Formal framework and algorithm for multi-net systems

Introduction • Learning: • “process which leads to the modification of behaviour”1 • Biological motivation • Hebb’s neurophysiological postulate2 • Learning across cell assemblies: neural integration • Functional specialism: analogy to multi-net systems • Theoretical motivation • Generalisation improvements with multi-net systems • Ensemble and modular • Learning in collaboration with modularisation

Single-net Systems • Systems of one or more artificial neurons combined together in a single network • Parallel distributed processing3 • Learning to generalise • Learning algorithms • Supervised: delta4,5,6, backpropagation7,8,9 • Unsupervised: Hebbian2, SOM10

1 Combination of Linear Decision Boundaries 1 -1 -1 1 1 -1 -1 1 1 -1 -1 1 True (1) False (-1) -1 -1 1 Single-nets as Multi-nets? XOR

From Single-nets to Multi-nets • Multi-net systems appear to be a development of the parallel processing paradigm • Can multi-net systems improve generalisation? • Modularisation with simpler networks? • Limited theoretical and empirical evidence • Generalisation: • Balance prior knowledge and training • VC Dimension11,12,13 • Bias/variance dilemma14

Multi-net Systems:Ensemble or Modular? • Ensemble systems: • Parallel combination • Each network performs the same task • Simple ensemble • AdaBoost15 • Modular systems: • Each network performs a different (sub-)task • Mixture-of-experts16 (top-down parallel competitive) • Min-max17 (bottom-up static parallel/sequential)

Categorising Multi-net Systems • Sharkey’s18,19 combination strategies: • Parallel: co-operative or competitive top-down or bottom-up static or dynamic • Sequential • Supervisory • Component networks may be20: • Pre-trained (independent) • Incrementally trained (iterative) • In-situ trained (simultaneous)

Multi-net Systems • Categorisation schemes appear not to support the generalisation of multi-net system properties beyond specific examples • Ensemble: bias, variance and diversity21 • Modular: bias and variance • What about measures such as the VC Dimension? • Some use of in-situ learning • ME and HME22 • Negative correlation learning23

Research • Multi-net systems seem to offer a way in which generalisation performance and learning speed can be improved: • Yet limited theoretical and empirical evidence • Focus on parallel systems • Limited use of in-situ learning despite motivation • Existing results show improvement • Can the approach be generalised? • No general framework for multi-net systems • Difficult to generalise properties from categorisation

Research • Explore in-situ learning in multi-net systems: • Parallel: in-situ learning in the simple ensemble • Sequential: combining networks with in-situ learning • Does in-situ learning provide improved generalisation? • Can we combine ‘simple’ networks to solve ‘complex’ problems: ‘superordinate’ systems with faster learning? • Propose a formal framework for multi-net systems • A method to describe the architecture and learning algorithm for a general multi-net system

Multi-net System Framework • Previous work: • Framework for the co-operation of learning algorithms24 • Stochastic model25 • Importance Sampled Learning Ensembles (ISLE)26 • Focus on supervised learning and specific architectures • Jordan and Xu’s (1995) definition of HME27: • Generalisation of HME • Abstraction of architecture from algorithm • Theoretical results for convergence

Multi-net System Framework • Propose a modification to Jordan and Xu’s definition of HME to provide a generalised multi-net system framework • HME combines the output of the expert networks through a weighting generated by a gating network • Replace the weighting by the (optional) operation of a network • Can be used for parallel, sequential and supervisory systems

Example: HME

Example: Framework

Definition • A multi-net system consists of the ordered tree of depth r defined by the nodes , with the root of the tree associated with the output , such that:

Multi-net System Framework • Learning algorithm operates by modifying the state of the system as defined by associated with each node • Includes: • Pre-training • In-situ training • Incremental training (through pre- or in-situ training) • Examples demonstrate how framework can be used to describe existing types of multi-net system • However does not rely upon categorisation schemes

In-situ Learning • Evaluation of in-situ learning in multi-net systems • Explore parallel and sequential in-situ learning with definitions using the proposed framework • Simple learning ensemble (SLE) • Sequential learning modules (SLM) • Benchmark classification tasks28: • XOR • MONK’s problems (MONK 1, 2, 3)29 • Wisconsin Breast Cancer Database (WBCD)30 • Proben1 Thyroid (thyroid1 data set)31

Simple Learning Ensemble • Ensembles: • Train each network individually • Parallel combination of network outputs: mean output • Pre-training: how can we understand or control the combined performance of the ensemble? • Incremental: AdaBoost15 • In-situ: negative correlation23 • In-situ learning: • Train in-situ and assess combined performance during training using early stopping • Generalisation loss early stopping criterion32

Benchmark Results • Compare SLE results with: • Simple ensemble (SE): all networks pre-trained using early stopping • Single-net: MLP with backpropagation – with and without early stopping • For SLE and SE: • From 2 to 20 MLP networks • 100 trials per configuration: mean response

Benchmark Results • XOR • SLE and SE equivalent training (no early stopping) • SLE uses less epochs to give equivalent responses • MONK’s problems/WBCD/Thyroid • SLE improves upon SE, SE improves upon single-net • SLE trains for longer: combined performance • Adding more networks gives better generalisation • More networks, more achieved desired performance • MONK 1/2 • SLE improves upon non-early stopping single-net

Sequential Learning Modules • Sequential systems: • Can a combination of (simpler) networks give good generalisation and learning speed? • Typically pre-trained and for specific processing • Sequential in-situ learning: • How can in-situ learning be achieved with sequential networks: target error/output? • Use unsupervised networks • Last network has target output and hence can be supervised

Benchmark Results • Compare SLM results with: • Single-net: single layer network with delta learning • Single-net: MLP with backpropagation – without early stopping • Networks: • Combine a SOM with a single layer network with delta learning • Varying map sizes of SOM used to see effect on classification performance • 100 trials per configuration: mean response

Benchmark Results • XOR • Cannot be solved by SOM or single layer network • Can be solved by SLM with 3x3 map • Faster learning times than MLP with backpropagation • MONK’s problems/WBCD • SLM can learn classification of training examples • Generalisation: improves upon SE with early stopping • MONK 2/3 • SLM improves upon MLP without early stopping • MONK 3 improves upon SLE

Benchmark Results • MONK 1/WBCD • Generalisation: similar, but slightly worse • Thyroid • Poor learning of training examples • Can SOM perform sufficient pre-processing for single layer network? • Results seem to depend upon problem type and map size

In-situ Learning • In-situ learning in multi-net systems: • Can give better training and generalisation performance • Comparison with SE (early stopping) and single-net systems (with and without early stopping) • Computational effort? • Ensemble systems: • Effect of training times on bias, variance and diversity? • Sequential systems: • Encouraging empirical results: theoretical results? • Automatic classification of unsupervised clusters

In-situ Learning

In-situ Learning and Simulation • Biological motivation for in-situ learning • Hebb’s ‘neural integration’ • Functional specialism in cognitive systems • Use in-situ learning in modular multi-net systems to simulate numerical abilities: • Quantification: subitization and counting • Addition: fact retrieval and ‘count all’ • Combine ME and SLM to allow abilities to ‘compete’

Strategy Learning System

Multi-net Simulation of Quantification • Single-net and multi-net systems • Trained on scenes consisting of ‘objects’ • Logarithmic probability model • Subitization SOM: • Ordering of numbers with compressive scheme • Limit: training data, object frequency and map size • Counting MLP with backpropagation: • Correct responses to training data • Conventional, stable non-conventional and nonstable

Multi-net Simulation of Quantification • MNQ: • Subitization: SOM (pre-trained) • Single layer network with delta learning • Counting: MLP with backpropagation • Successfully learnt to quantify • To subitize or count • Decision based upon input • Subitization limit attributable to interaction of modules • Lower numbers subitized, higher numbers counted

Multi-net Simulation ofAddition • Single-net and multi-net systems • Trained on scenes consisting of two sets of ‘objects’ • Equal probability model • Fact retrieval SOM: • Each addend associated with a map axis • Some overlap of problems • Commutative information used • ‘Count all’ MLP with backpropagation: • Correct responses to training data • Some relationship to observed human errors

Multi-net Simulation ofAddition • MNA: • Fact retrieval: SOM • Single layer network with delta learning • ‘Count all’: MLP with backpropagation • Successfully learnt to add • To count or from facts • Decision based upon input • Predominant use of ‘count all’, rather than facts • Does demonstrate change in strategy

In-situ Learning and Simulation • In-situ learning in SLS: • Simulate developmental progression • At least as capable as equivalent monolithic solutions • Competition between supervised and unsupervised learning paradigms • In-situ learning and simulation of the interaction between multiple abilities: • Interaction between different abilities: simulating psychological phenomena • Integrated learning

Contribution • Multi-net systems: • Seem to offer empirical and theoretical improvement to generalisation • Properties under-explored • Framework for multi-net systems • Generalisation of multi-net systems beyond categorisation schemes • Foundation to explore multi-net properties • In-situ learning • Biological and theoretical motivation

Contribution • Compared different multi-net techniques and the use of in-situ learning • Demonstrated that in-situ learning can lead to improved training and generalisation • SLE • Assessment of combined performance • SLM • Combining ‘simple’ networks to solve ‘complex’ tasks • ‘Superordinate’? • Also shows automatic classification using SOM

Contribution • Integrated learning • Simulations using modular parallel and sequential networks • In-situ learning used to explore the interaction of modules during learning • Demonstrated simulation of abilities and observed psychological phenomena

Future Work • Framework: • Modify learning algorithm – recursive using tree • Explore properties such as bias, variance, diversity and VC Dimension • In-situ learning: • Further comparison • SLE: does learning promote diversity? • SLM: expand and explore limitations • Simulations: • Explore further the effect of integrated learning

Questions?

References • Simpson, J.A. & Weiner, E.S.C. (Ed) (1989). Oxford English Dictionary, 2nd. Oxford, UK: Clarendon Press. • Hebb, D.O. (1949). The Organization of Behavior: A Neuropsychological Theory. New York: John Wiley & Sons. • McClelland, J.L. & Rumelhart, D.E. (1986). Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume 2: Psychological and Biological Models. Cambridge, MA.: A Bradford Book, MIT Press. • Rosenblatt, F. (1958). The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain. Psychological Review, vol. 65(6), pp. 386-408. • Widrow, B. & Hoff, M.E.Jr. (1960). Adaptive Switching Circuits. IRE WESCON Convention Record, pp. 96-104. • Minsky, M.L. & Papert, S. (1988). Perceptrons: An Introduction to Computational Geometry, Expanded Ed. Cambridge, MA.: MIT Press. • Werbos, P.J. (1974). Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. Unpublished doctoral thesis. Cambridge, MA.: Harvard University.

References • Rumelhart, D.E., Hinton, G.E. & Williams, R.J. (1986). Learning Internal Representations by Error Propagation. In Rumelhart, D. E. & McClelland, J. L. (Ed), Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume 1: Foundations, pp. 318-362. Cambridge, MA.: MIT Press. • Elman, J.L. (1990). Finding Structure in Time. Cognitive Science, vol. 14, pp. 179-211. • Kohonen, T. (1982). Self-Organized Formation of Topologically Correct Feature Maps. Biological Cybernetics, vol. 43, pp. 59-69. • Vapnik, V.N. & Chervonenkis, A.Ya. (1971). On the Uniform Convergence of Relative Frequencies of Events to their Probabilities. Theory of Probability and Its Applications, vol. XVI(2), pp. 264-280. • Baum, E.B. & Haussler, D. (1989). What Size Net Gives Valid Generalisation? Neural Computation, vol. 1(1), pp. 151-160. • Koiran, P. & Sontag, E.D. (1997). Neural Networks With Quadratic VC Dimension. Journal of Computer and System Sciences, vol. 54(1), pp. 190-198. • Geman, S., Bienenstock, E. & Doursat, R. (1992). Neural Networks and the Bias/Variance Dilemma. Neural Computation, vol. 4(1), pp. 1-58.

References • Freund, Y. & Schapire, R.E. (1996). Experiments with a New Boosting Algorithm. Machine Learning: Proceedings of the 13th International Conference, pp. 148-156. Morgan Kaufmann. • Jacobs, R.A., Jordan, M.I. & Barto, A.G. (1991). Task Decomposition through Competition in a Modular Connectionist Architecture: The What and Where Vision Tasks. Cognitive Science, vol. 15, pp. 219-250. • Lu, B. & Ito, M. (1999). Task Decomposition and Module Combination Based on Class Relations: A Modular Neural Network for Pattern Classification. IEEE Transactions on Neural Networks, vol. 10(5), pp. 1244-1256. • Sharkey, A.J.C. (1999). Multi-Net Systems. In Sharkey, A. J. C. (Ed), Combining Artificial Neural Nets: Ensemble and Modular Multi-Net Systems, pp. 1-30. London: Springer-Verlag. • Sharkey, A.J.C. (2002). Types of Multinet System. In Roli, F. & Kittler, J. (Ed), Proceedings of the Third International Workshop on Multiple Classifier Systems (MCS 2002), pp. 108-117. Berlin, Heidelberg, New York: Springer-Verlag. • Liu, Y., Yao, X., Zhao, Q. & Higuchi, T. (2002). An Experimental Comparison of Neural Network Ensemble Learning Methods on Decision Boundaries. Proceedings of the 2002 International Joint Conference on Neural Networks (IJCNN'02), vol. 1, pp. 221-226. Los Alamitos, CA.: IEEE Computer Society Press.

References • Kuncheva, L.I. (2002). Switching Between Selection and Fusion in Combining Classifiers: An Experiment. IEEE Transactions on Systems, Man, and Cybernetics, Part B, vol. 32(2), pp. 146-156. • Jordan, M.I. & Jacobs, R.A. (1994). Hierarchical Mixtures of Experts and the EM Algorithm. Neural Computation, vol. 6(2), pp. 181-214. • Liu, Y. & Yao, X. (1999a). Ensemble Learning via Negative Correlation. Neural Networks, vol. 12(10), pp. 1399-1404. • Bottou, L. & Gallinari, P. (1991). A Framework for the Cooperation of Learning Algorithms. In Lippmann, R.P., Moody, J.E. & Touretzky, D.S. (Ed), Advances in Neural Information Processing Systems, vol. 3, pp. 781-788. • Amari, S.-I. (1995). Information Geometry of the EM and em Algorithms for Neural Networks. Neural Networks, vol. 8(9), pp. 1379-1408. • Friedman,J.H. & Popescu,B. (2003). Importance Sampling: An Alternative View of Ensemble Learning. Presented at the 4th International Workshop on Multiple Classifier Systems (MCS 2003). Guildford, UK. • Jordan, M.I. & Xu, L. (1995). Convergence Results for the EM Approach to Mixtures of Experts Architectures. Neural Networks, vol. 8, pp. 1409-1431.

References • Blake,C.L. & Merz,C.J. (1998). UCI Repository of Machine Learning Databases. http://www.ics.uci.edu/~mlearn/MLRepository.html. Irvine, CA.: University of California, Irvine, Department of Information and Computer Sciences. • Thrun, S.B., Bala, J., Bloedorn, E., Bratko, I., Cestnik, B., Cheng, J., De Jong, K., Dzeroski, S., Fahlman, S.E., Fisher, D., Hamann, R., Kaufman, K., Keller, S., Kononenko, I., Kreuziger, J., Michalski, R.S., Mitchell, T., Pachowicz, P., Reich, Y., Vafaie, H., van de Welde, W., Wenzel, W., Wnek, J. & Zhang, J. (1991). The MONK's Problems: A Performance Comparison of Different Learning Algorithms. Technical Report CMU-CS-91-197. Pittsburgh, PA.: Carnegie-Mellon University, Computer Science Department. • Wolberg, W.H. & Mangasarian, O.L. (1990). Multisurface Method of Pattern Separation for Medical Diagnosis Applied to Breast Cytology. Proceedings of the National Academy of Sciences, USA, vol. 87(23), pp. 9193-9196. • Prechelt, L. (1994). Proben1: A Set of Neural Network Benchmark Problems and Benchmarking Rules. Technical Report 21 / 94. Karlsruhe, Germany: University of Karlsruhe. • Prechelt, L. (1996). Early Stopping - But When? In Orr, G. B. & Müller, K-R. (Ed), Neural Networks: Tricks of the Trade, 1524, pp. 55-69. Berlin, Heidelberg, New York: Springer-Verlag.

Integrated Learning in Multi-net Systems