120 likes | 391 Views
Extensions of vector quantization for incremental clustering. Edwin Lughofer PR, Vol.41 2008, pp. 995–1011 Presenter : Wei- Shen Tai 20 11 / 1/19. Outline . Introduction Vector quantization Extensions of vector quantization Evaluation Conclusion and outlook Comments . Motivation .
E N D
Extensions of vector quantization for incremental clustering Edwin Lughofer PR, Vol.41 2008, pp. 995–1011 Presenter : Wei-Shen Tai 2011/1/19
Outline • Introduction • Vector quantization • Extensions of vector quantization • Evaluation • Conclusion and outlook • Comments
Motivation • Incremental clustering processes • Quite often online measurements are recorded resulting in data streams for various applications. • In an online manner, guarantee that queries are up-to-date and that results can be answered with a small time delay.
Objective • An incremental and evolving vector quantization • Processes data streams in a on-line clustering scheme. • Omits pre-definition of the number of clusters and improve the quality of cluster partitions with several strategies.
Vector quantization • Choose initial values for the C cluster centers. • Fetch out the next data sample of the data set. • Calculate the distance of the selected data point to all cluster centers. • Elicit the cluster center which is closest to the data point. • Update the p components of the winning cluster by moving it towards the selected point. • If the data set contains data points which were not processed through steps 2–5, goto step 2. • If any cluster center was moved significantly in the last iteration, say more than , reset the pointer to the data buffer at the beginning and goto step 2, otherwise stop.
Vector quantization in incremental mode • Stability / plasticity dilemma in ART-2 • Using vigilance parameter ρtocontrol the tradeoff between adaptation of already learned clusters (stability) and generation of new clusters (plasticity). • Differences between VQ and VQ-INC • The starting number of clusters is zeros. • If the distance between the incoming input x and the closest cluster center cwin is larger than ρand x is not faulty, a new cluster will be created. Otherwise, cwinis updated to move toward to x. • Update the ranges of all p variables if x is not faulty. Besides, ηis changed with the amount of data points belonging to each cluster in a monotonic decreasing way.
An alternative distance strategy • Both ‘over-clustering’ and incorrect partition of the input space occur in VQ-INC. • Instead of classic Euclidean distance, the ranges of influence for all clusters or the surface along the direction towards the cluster center are applied in VQ-INC-EXT.
Satellite deletion • Cluster satellites • Undesirable tiny clusters, which lie very close to significantly bigger ones. • Identify outliers and satellites • If ki/N <1%, cluster i is regarded as an outlier cluster. • If ki/N < low_mass and cilies inside the range of influence of any other cluster, elicit the closest centercwin. • Calculate the distance of ci to the surface of all other clusters.
A split-and-merge strategy • Parameter ρ • Cannot be known in advance and a bad setting may cause an incorrect cluster structure. • Not-optimal clustering • It is prevented by merging clusters grown together or by splitting big clusters including more than one distinct data cloud. • Calculate the quality of cluster partition in three phases including before spilt, after spilt (p results)and after merged. Then pick the best cluster partition to replace existing one.
Conclusion and outlook • A new extended vector quantization (VQ-INCEXT) • Can be applied for data streams in fast online applications or for huge data bases. • Provides an incremental learning scheme and incorporates new distance measurement, satellite deletion and online split-and-merge strategy. • Outlooks • Split-and-merge strategy may suffer from computation speed. • Reacting to drifts or shifts in the data, drifts changes the distribution of the underlying data smoothly over time; shifts trigger abrupt and sudden changes of the data characteristics.
Comments • Advantage • This proposed method extends VQ to a incremental learning VQ and adds several strategies to improve the quality of cluster partition simultaneously. • Data streams can be effectively processed by this on-line learning VQ. • Drawback • In algorithm 3, the vector of winning cluster is updated by Eq.(1) according to the Manhattan distance between the winning cluster and the input whenever the new distance strategy is applied. • Application • Data stream on-line learning issue.