170 likes | 270 Views
Very large data sets. Speech and Image Processing Unit School of Computing University of Eastern Finland. Clustering methods: Part 10. Pasi Fränti. 5.5.2014. Let’s study this (no material for the others) . Methods for large data sets. Birch Clarans On-line EM Scalable EM GMG.
E N D
Very large data sets Speech and Image Processing UnitSchool of Computing University of Eastern Finland Clustering methods: Part 10 Pasi Fränti 5.5.2014
Let’s study this (no material for the others) Methods for large data sets • Birch • Clarans • On-line EM • Scalable EM • GMG
Gradual model generator (GMG)[Kärkkäinen & Fränti, 2007: Pattern Recognition]
Goal of the GMG algorithm GMG EM
Model update • New data points are mapped immediately when input. • Points too far (from any model) will remain in buffer. • Buffered points are re-tested when new models created. Before update After update
Generating new components • When buffer full, selected points are used to generate new components. • Most compact k-neighborhood is selected as seed for a new component. Data in buffer Selected points and a new component
Post-processing Model before processing
Post-processing Model before processing Updated model
Post-processing Model before processing Updated model + data
Literature • I. Kärkkäinen and P. Fränti, "Gradual model generator for single-pass clustering", Pattern Recognition, 40 (3), 784-795, March 2007. • P. Bradley, U. Fayyad, C. Reina, Clustering Very Large Databases Using EM Mixture Models, Proc. of the 15th Int. Conf. on Pattern Recognition, vol. 2, 2000, pp. 76-80. • R. Ng, J. Han, CLARANS: A Method for Clustering Objects for Spatial Data Mining, IEEE Trans. Knowledge & Data Engineering 14(5) (2002) 1003-1016. • M. Sato, S. Ishii, On-line EM Algorithm for the Normalized Gaussian Network, Neural Computation 12(2) (2000) 407-432. • T. Zhang, R. Ramakrishnan, M. Livny, BIRCH: A New Data Clustering Algorithm and Its Applications, Data Mining and Knowledge Discovery 1(2) (1997) 141-182.