1 / 17

Very large data sets

Very large data sets. Speech and Image Processing Unit School of Computing University of Eastern Finland. Clustering methods: Part 10. Pasi Fränti. 5.5.2014. Let’s study this (no material for the others) . Methods for large data sets. Birch Clarans On-line EM Scalable EM GMG.

Download Presentation

Very large data sets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Very large data sets Speech and Image Processing UnitSchool of Computing University of Eastern Finland Clustering methods: Part 10 Pasi Fränti 5.5.2014

  2. Let’s study this (no material for the others)  Methods for large data sets • Birch • Clarans • On-line EM • Scalable EM • GMG

  3. Gradual model generator (GMG)[Kärkkäinen & Fränti, 2007: Pattern Recognition]

  4. Goal of the GMG algorithm GMG EM

  5. Contours of probability density distributions GMG EM

  6. Model update • New data points are mapped immediately when input. • Points too far (from any model) will remain in buffer. • Buffered points are re-tested when new models created. Before update After update

  7. Generating new components • When buffer full, selected points are used to generate new components. • Most compact k-neighborhood is selected as seed for a new component. Data in buffer Selected points and a new component

  8. Example

  9. Example

  10. Example

  11. Example

  12. Example

  13. Example

  14. Post-processing Model before processing

  15. Post-processing Model before processing Updated model

  16. Post-processing Model before processing Updated model + data

  17. Literature • I. Kärkkäinen and P. Fränti, "Gradual model generator for single-pass clustering", Pattern Recognition, 40 (3), 784-795, March 2007. • P. Bradley, U. Fayyad, C. Reina, Clustering Very Large Databases Using EM Mixture Models, Proc. of the 15th Int. Conf. on Pattern Recognition, vol. 2, 2000, pp. 76-80. • R. Ng, J. Han, CLARANS: A Method for Clustering Objects for Spatial Data Mining, IEEE Trans. Knowledge & Data Engineering 14(5) (2002) 1003-1016. • M. Sato, S. Ishii, On-line EM Algorithm for the Normalized Gaussian Network, Neural Computation 12(2) (2000) 407-432. • T. Zhang, R. Ramakrishnan, M. Livny, BIRCH: A New Data Clustering Algorithm and Its Applications, Data Mining and Knowledge Discovery 1(2) (1997) 141-182.

More Related