340 likes | 360 Views
Dynamic Self-Organizing Maps with Controlled Growth for Knowledge Discovery. Authors: Alahakoonand Halgamuge Advisor: Dr. Hsu Graduate: Yu-Wei, Su. Outline. Motivation Objective Introduction
E N D
Dynamic Self-Organizing Maps with Controlled Growth for Knowledge Discovery Authors: Alahakoonand Halgamuge Advisor: Dr. Hsu Graduate: Yu-Wei, Su Intelligent Database System Lab,IDSL
Outline • Motivation • Objective • Introduction • Self-generating feature maps for data mining • GSOM algorithm • Advantages of GSOM over others • Knowledge discovery by hierarchical clustering of GSOM • Experiment • Conclusion • Opinion Intelligent Database System Lab,IDSL
Motivation • Predetermination of size and number of nodes in SOM results in a significant limitation on final mapping • The limitation let the user not being aware the results of structure presentation Intelligent Database System Lab,IDSL
Objective • To determine the shape as well as the size of the network during the training of the network Intelligent Database System Lab,IDSL
Introduction • SOM in its original form does not provide complete topology preservation theoretically • The completion of the simulation that a different sized network would have been more appropriate for the application • SOM have to predetermine the size and number of nodes results a signification limitation Intelligent Database System Lab,IDSL
Introduction( cont.) • To determine the shape as well as the size of the network during the training of the network • The need for a measure for controlling the growth of the GSOM is highlighted Intelligent Database System Lab,IDSL
Self-generating feature maps for data mining • Growing Cell Structures (GCS’s) [Fritzke, 1991] • Neural Gas Algorithm [Martinetz and Shulten, 1991] • Incremental Grid Growing (IGG) [Blackmore,1995] Intelligent Database System Lab,IDSL
GSOM algorithm • GSOM is initialized with four nodes and grows nodes to represent the input data • Weight values of the nodes are self-organizing according to a similar method as the SOM Intelligent Database System Lab,IDSL
GSOM algorithm( cont.) • Initialization phase • Initialize the weight vectors of the nodes with random numbers • Calculate the growth threshold (GT) according to the user requirements • Growing phase Intelligent Database System Lab,IDSL
GSOM algorithm( cont.) • Increase the error value of the winner • When > GT. Grow nodes if i is a boundary node. IF then Else remains unchangeed Intelligent Database System Lab,IDSL
GSOM algorithm( cont.) • New node generation • If a node is selected for growth , all its free neighboring positions will be grown new nodes • Weight initialization of new nodes If Then If Then Intelligent Database System Lab,IDSL
GSOM algorithm( cont.) If Then If Then Wnew=m, m=(r1+r2)/2 and r1,r2 being the lower and upper value of the range of the Weight vector distribution Intelligent Database System Lab,IDSL
GSOM algorithm( cont.) • Distribute weights to neighbors if i is a nonboundary node Intelligent Database System Lab,IDSL
GSOM algorithm( cont.) • Distribute the error to the neighboring nodes • Initialize the learning rate (LR) to its starting value • Repeat steps until all inputs have been presented and node growth is reduced to a minimum level • Smoothing phase • to occur after the new node growing phase γis factor of distribution (FD), 0<FD<1 Intelligent Database System Lab,IDSL
GSOM algorithm( cont.) • The purpose is to smooth out any existing quantization error • No new node are adds during this phase • LR in this phase is less than the growing phase, since the weight values should not fluctuate too much Intelligent Database System Lab,IDSL
GSOM algorithm( cont.) • The neighborhood is constrained only to the immediate neighborhood • No node growth Intelligent Database System Lab,IDSL
Advantages of GSOM over others • Learning rate adaptation • In GSOM due to the small number of nodes at the beginning, this causes a problem • The problem can be improved when the input data are ordered by not available in unsupervised learning • The solution LR(t+1)=LR(t) x α, 0<α<1 LR(t+1)=α x ψ(n) x LR(t) φ(n) can be used is (1-R/n(t)) Intelligent Database System Lab,IDSL
Advantages of GSOM over others( cont.) • Localized neighborhood weight adaptation • During SOM training, the neighborhood is large and shrink linearly to one node • The GSOM does not require, since new weight nodes are initialized to fit in with the existing neighborhood weights • The GSOM just requires s small neighborhood • Therefore , during growing phase , the GSOM initializes the LR and Nk to a starting value at each new input Intelligent Database System Lab,IDSL
Advantages of GSOM over others( cont.) • Error distribution of nonboundary nodes • the weight distribution produce an effect of spreading the error outwards from the high error node • Making ripple effect outwards and cause a boundary node to increase its error value Intelligent Database System Lab,IDSL
Knowledge discovery by hierarchical clustering of GSOM • The spread-out factor (SF) • GT is the threshold to decide when to initiate new node growth • A large GT will result in a map with a fewer number of nodes and this is an abstract picture of data and vice versa Intelligent Database System Lab,IDSL
Knowledge discovery by hierarchical clustering of GSOM( cont.) • But TE is sensitive to dimension of data and number of hits and it is hard to decide GT • The SF can be used to control and calculate the GT GT=D x f(SF) , 0<SF<1 Intelligent Database System Lab,IDSL
Knowledge discovery by hierarchical clustering of GSOM( cont.) Intelligent Database System Lab,IDSL
Knowledge discovery by hierarchical clustering of GSOM( cont.) Intelligent Database System Lab,IDSL
Experiment • The spread of the GSOM with increasing SF values • Zoo data set with 18 attributes and 99 tuples Intelligent Database System Lab,IDSL
Experiment( cont.) insects Intelligent Database System Lab,IDSL
Experiment( cont.) fish Airbone bird Nonpredatory mammal Nondomestic mammal noneAirbone bird Intelligent Database System Lab,IDSL
Experiment( cont.) • Hierarchical clustering of interesting clusters Small size Without tail Intelligent Database System Lab,IDSL
Experiment( cont.) • The GSOM for high-dimensional human genetic data set • With 43 dimension • Genetic information is derived from blood samples • Genetic distance between the population with a measure Fst • Fst uses a form of normalization to account for frequencies that are not normally distributed Intelligent Database System Lab,IDSL
Experiment( cont.) Intelligent Database System Lab,IDSL
Experiment( cont.) Intelligent Database System Lab,IDSL
Conclusion • The shape of the GSOM represents the grouping in the data and has a better attracting attention for further investigation • The number of nodes required less than SOM and results in faster processing • A hierarchical clustering make more detail data investigation Intelligent Database System Lab,IDSL
Opinion • Quantization error is a popular factor in SOM • This paper is a classical paper • Does error distribution in nonboundary make the destruction of topology preservation? Intelligent Database System Lab,IDSL