1 / 30

Interactive Clustering Overview & Tools for Data Analysis

Explore methods like agglomerative clustering, k-means, and VERI for grouping data interactively. Learn concepts like partition space structuring and explore software modeling tools for visualizing clusters. Enhance your data analysis skills!

johnbadams
Download Presentation

Interactive Clustering Overview & Tools for Data Analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Interactive ClusteringOverview and Tools Wayne Oldford University of Waterloo CASI 2001

  2. Overview 1. Finding groups in data 2. Interactive data analysis 3. Enlarging the problem 4. Putting it together 5. Software modelling (illustration) 6. Summary CASI 2001

  3. 1.Finding groups in data • Objects to be grouped together • locations • pairwise (dis)similarities • Web documents as objects to be grouped • Applications: • Building groups to use later as classification • Building groups to serve as templates • Building groups to understand/model CASI 2001

  4. group definition is a problem • Group definition (like with like) • homogeneous vs heterogeneous • part of pattern CASI 2001

  5. Clustering approaches • Agglomerative (near points/clusters are joined) • Single linkage • Complete linkage • Average linkage • Recursive splitting • e.g. minimal spanning tree CASI 2001

  6. Cluster hierarchies • Clusters are nested • Often represented as a tree (dendrogram) • Join/split history and ‘strength’ preserved CASI 2001

  7. Other approaches • k-means • assign points to k groups • re-assign to improve objective function • model-based • likelihood/Bayesian; model search/averaging • density estimation • groups = high-density regions • classification to cluster • visually motivated methods CASI 2001

  8. Visual Empirical Regions of Influence (VERI) CASI 2001

  9. Notes • many choices • between and within methods • built-in biases for shapes • computationally costly • O(n2) ... Conceptual model: algorithmic, run to completion CASI 2001

  10. typical software • resources dedicated to numerical computation • teletype interaction • runs to completion • graphical “output” Compare to interactive data analysis CASI 2001

  11. 2. Interactive data analysis CASI 2001

  12. Interactive data analysis • exploratory, tentative • graphical • non-algorithmic • varied granularity • integrated • deep interaction CASI 2001

  13. 3. Enlarging the problem Mutually exclusive and exhaustive groups g1, g2, …, gk form a partition P = {g1, g2, …, gk} of the set of data objects. Goal: Explore the space of possible partitions. CASI 2001

  14. Structuring the partition space PA={g1, g2, …, ga} and PB={h1, h2, …, hb} • When a > b, PA call a finerpartition than PB. • PA is called a refinement of PB (or PB a reduction of PA) • PA is nested in PB only if a > b and every gi is a subset of a single hj - write PA } PB or PB { PA • When a = b, PA is called a reassignment of PB CASI 2001

  15. Reduction P1={g1, ..., g6} -> P2={h1, …, h4} -> P3={m1, m2, m3} • nesting: P1 } P2 • hi = gi i = 1, 2 ; h3= join (g3, g4) ; h4= join (g5, g6) • disperse elements of h4 over hi i = 1, 2, 3 to give mi for i = 1, 2, 3. • split (h4) = {h1*, h2*, h3*}; mi = join (hi*, hi ) • P2 } P3 is false CASI 2001

  16. Reduction decisions/options • join operations: which groups? • e.g. inner, outer, centres, … • distance measures to use … • dispersal operations: • selecting group(s) • Max volume, eigen-value, MST… • determining partitional method • random, VERI, MST, … • choosing join … CASI 2001

  17. Refinement P2={h1, …, h4} ---> P1={g1, ..., g6} nesting: P2 { P1 • gi = hi i = 1, 2 ; split (h3) -> g3, g4 split (h4) -> g5, g6 CASI 2001

  18. Refinement decisions/options • which groups to split? • e.g. inner, outer, directions, … • distance measures to use … • how to split? • MST, outlying points, reassignment, ... CASI 2001

  19. Reassignment P1={g1, ..., gk} -> P2={h1, …, hk} • objective function d(P) to be minimized. P <- P1 • for each object o in gi, assign it to one of gj (j != i) forming a new partition Pij and find largest Dij(o) = d(P) - d(Pij) • repeat for all i, j. If max Dij > 0 move o from gi, to gj • Repeat until D max <=0 CASI 2001

  20. Reassignment decisions/options • Objective function • distances, centres, … • within vs between/within, ... • variates/directions • Iteration strategy • single-pass, k-means, complete looping (greedy), start, … CASI 2001

  21. 4. Putting it together Series of moves in partition space: 1. Refine (P) -- > Pnew 2. Reduce (P) -- > Pnew 3. Reassign (P) -- > Pnew CASI 2001

  22. Additional ops on partitions • Unary: • Subset (P) • Operate any of R (subset (P)) • Manual (P) … change P according to manual intervention (e.g. colouring) CASI 2001

  23. n-ary operators • resolve (P1, ..., Pm) --> Pnew • dissimilarity (Pi, Pj) --> di,j • display (P1, ..., Pm) • dendrogram if P1 { …{ Pm • mds plot of all clusters in P1, …, Pm • mds plot of all partitions P1, …, Pm CASI 2001

  24. 5. Software modelling • Principal control panel: • current partition and list of saved partitions • refine, reduce, re-assign, re-start buttons • cluster plot button (mds plot) • random select button • subset focus and join toggle • operation on partitions button • manual button (form partition from point colours) CASI 2001

  25. Secondary panels • Refine: • performs refine, offers access to arguments • Reduce • performs reduce, offers access to arguments • Reassign • performs reassign, offers access to arguments • Each will operate on only those points highlighted or on all if none selected. CASI 2001

  26. Secondary panels (continued) • Operate on partitions • saved partitions list • resolve selected partition • plot selected partitions using selected dissimilarity • dendrogram of selected partitions (if nested) • cluster-plot for clusters of selected parttitions (esp. for non-nested) CASI 2001

  27. Software modelling (details) • Objects: • Point-symbols, case-objects (existing in Quail) • Cluster-points • Clusters • Partitions • Methods • Reduce, refine, reassign, ... CASI 2001

  28. Software illustration • Two prototype displays (buggy) • Single-window • Separate windows • Integration with existing Quail graphics • Manual, dendrogram, cluster plots, … • VERI clustering CASI 2001

  29. 6. Summary • Cluster analysis is naturally exploratory and needs integration with modern interactive data analysis. • Enlarging the problem to partitions: • simplifies and gives structure • encourages exploratory approach • integrates naturally • introduces new possibilities (analysis and research) CASI 2001

  30. Acknowledgements • Erin McLeish, several undergraduates and graduate students in statistical computing course at Waterloo • Quail: Quantitative Analysis in Lisp http:/www.stats.uwaterloo.ca/Quail CASI 2001

More Related